yxml updates

2013-10-02 16:27:05 +02:00 · 2013-10-02 16:27:05 +02:00 · 94469e44e3
commit 94469e44e3
parent 4e7538db99
1 changed files with 66 additions and 23 deletions
--- a/dat/yxml
+++ b/dat/yxml
@ -1,10 +1,11 @@
 =pod

-I<*But see the L<Bugs and Limitations|/Bugs and Limitations> below.>
+I<*But see the L<Bugs and Limitations|/Bugs and Limitations> and L<Conformance Issues|/Conformance Issues> below.>

-Yxml is a small (C<6 KiB>) non-validating yet mostly conforming XML parser
-written in C.  Its primary goals are small binary size, simplicity and
-correctness. It also happens to be L<pretty fast|/Comparison>.
+Yxml is a small (C<6 KiB>) L<non-validating|/Validating vs. non-validating> yet
+mostly conforming XML parser written in C.  Its primary goals are small binary
+size, simplicity and correctness. It also happens to be L<pretty
+fast|/Comparison>.

 The code can be obtained from the L<git repo|http://g.blicky.net/yxml.git> and
 is available under a permissive MIT license. The only two files you need are
@ -60,11 +61,6 @@ But let's not be I<too> optimistic, because there are also...

 =over

-=item * Element and Attribute names may only consist of ASCII characters.
-
-=item * Does not verify that non-ASCII characters in attribute values or
-element contents are within the allowed character ranges.
-
 =item * A conditional section in a C<< <!DOCTYPE ..> >> declaration will result
 in a parse error.

@ -77,33 +73,51 @@ not available through the API.

 I hope to have these issues fixed in the near future.

-=head3 Non-features
-
-And now follows a list of things that are not supported and probably never will
-be. Most items on this list can be implemented on top of yxml.
+=head3 Conformance Issues

 =over

-=item * Does not verify all well-formedness constraints. In particular, does
-not verify that attribute names within the same element are unique, and does
-not verify that the contents of a C<< <!DOCTYPE ..> >> declaration follow the
-XML grammar.
+=item * Does not verify that non-ASCII characters in element names, element
+content, attribute names and attribute values are within the allowed Unicode
+character ranges.

-=item * No helper functions to deal with namespaces. Yxml will parse XML files
-with namespaces just fine, but it's up to the application to do the rest.
+=item * Does not verify that attribute names within the same element are unique.
+
+=item * Does not verify that the contents of a C<< <!DOCTYPE ..> >> declaration
+follow the XML grammar.
+
+=item * Can't parse documents in a non-ASCII-compatible encoding. You'll have
+to convert it to UTF-8 or something similar first.

 =item * No support for custom entity references, neither through the API nor
 using C<< <!ENTITY> >>.

+=back
+
+These conformance issues are the result of the byte-oriented and minimal design
+of yxml, and I do not intent to fix these directly within the library. All of
+the above mentioned issues can be fixed on top of yxml (by the application, or
+by a wrapper) if strict conformance is required. With the exception of custom
+entity references, but I have a simple idea on how to support that in the
+future, too.
+
+=head3 Non-features
+
+And now follows a list of things that are not part of the core XML
+specification and are not directly supported.  As with the conformance issues,
+these features can be implemented on top of yxml.
+
+=over
+
+=item * No helper functions to deal with namespaces. Yxml will parse XML files
+with namespaces just fine, but it's up to the application to do the rest.
+
 =item * No DTD or XML Schema validation.

 =item * No XSLT.

 =item * No XPath.

-=item * Can't parse documents in a non-ASCII-compatible encoding. You'll have
-to convert it to UTF-8 or something similar first.
-
 =item * Doesn't do your household chores.

 =back
@ -122,7 +136,7 @@ implementation is also included as an indication of the "theoretical" minimum.
  expat    2.1.0  MIT              162 139   194 432       1.47     1.09
  libxml2  2.9.1  MIT              464 328   518 816       2.53     1.75
  mxml     2.7    LGPL2+static      32 733    75 832      12.38     7.80
-  yxml     git    MIT                6 015    31 448       1.18     0.73
+  yxml     git    MIT                5 935    31 384       1.14     0.74

 The code for these benchmarks is available in the
 L<bench/|http://g.blicky.net/yxml.git/tree/bench> directory on git. Some
@ -164,3 +178,32 @@ with C<-Os> than with C<-O2>.
  libxml2   2.9.1  MIT             356 948   412 256       3.01     2.08
  mxml      2.7    LGPL2+static     27 725    71 704      11.70     7.44
  yxml      git    MIT               4 835    30 264       1.72     1.05
+
+
+=head2 Validating vs. non-validating
+
+TL;DR: yxml does I<not> accept garbage XML documents, it will correctly handle
+and report issues if the input does not strictly follow the XML grammar.
+
+The terms I<validating> and I<non-validating> have specific meanings within the
+context of XML. A validating parser is one that reads the doctype declaration
+(DTD) associated with a document, and validates that the contents of the
+document follow the rules described in the DTD. A DTD may also include
+instructions on how to parse the document, including the definition of custom
+entity references (C<&whatever;>) and instructions on how attribute values or
+element contents should be normalized before passing its data to the
+application.
+
+A non-validating parser is one that ignores the DTD and happily parses
+documents that do not follow the rules described in that DTD. They (usually)
+don't support entity references and will not normalize attribute values or
+element contents. A non-validating parser still has to verify that the XML
+document follows the XML syntax rules.
+
+It should be noted that a lot of XML documents found in the wild are not
+described with a DTD, but instead use an alternative technology such as XML
+schema. Wikipedia L<has more
+information|https://en.wikipedia.org/wiki/XML#Schemas_and_validation> on this.
+Using a validating parser for such documents would only introduce bloat and may
+introduce L<potential security
+vulnerabilities|https://en.wikipedia.org/wiki/Billion_laughs>.