… at least in Java.
1 – Namespace and import
XML is only apparently simple. As soon as namespace are used, it immediately gets complicated. What is the difference between targetNamespace=
”…”
, xmlns=”…”
and xmlns:tns=”…”
? Can I declare several prefixes for the same namespace? Can I change the default namespace from within a document? What happens if I import a schema and rebind it to another namespace? How do I reference an element unambiguously? Ever wondered how to really create a QName correctly? Ever wondered what happens if you have a cycle in your dependencies?
2 – Encoding and CDATA
XML encoding and file encoding are not the same. This is a huge source of troubles. Both encoding must match, and the XML file should be read and parsed according to the encoding specified in the XML header. Depending on the encoding, characters will be serialized in a different way, again a huge source of confusion. If the reader or writer of an XML document behave incorrectly, the document can be dangerously corrupted and information can be lost. Editors don’t necessary display the characters correctly, while the document may be right. Ever got a ? or ¿ in your text? Ever made a distinction between &
and &
? Ever wondered whether a CDATA
section was necessary or if using UTF-8 would be ok? Ever realized that < and > can be used as-is in attributes but need an encoding within a tag?
3 – Entities and DOCTYPE
Somehow relates to #2, but not only. XML entities are a generic way to define variables and are declared in the DOCTYPE. You can define custom entities; this is rather unusual but still need to be supported. Entites can be internal or external to your XML document, in which case the entity resolving might differ. Because entities are also used to escape special character, you can not consider this as an advanced feature that you won’t use. XML entities needs to be handled with care and is always a source of trouble. For instance, the tag <my-tag>hello&world</my-tag>
will trigger 3 characters(...)
events with SAX.
4 – Naming convention
Ever wondered whether it was actually better to name your tag <my-tag/>,
<myTag/>
or <MyTag/>
? The same goes for attributes….
5 – Null, empty string and white spaces
Making the difference between null and empty string with XML is always painful. Null would be represented by the absence of the tag or attribute, whereas empty string would be represented with an empty tag or empty attribute. The same problem appears if you want to distinguish empty list and no list at all. If not considered clearly upfront (which is frequently the case), it can be very hard to retrofit clearly this distinction in an application.
Whitespace is another issue on its own. The way tabs, spaces, carriage return, line feeds are processed is always confusing. There are some options to control that, but it’s way too complicated for most of the usage. As a consequence, sometimes these special characters will be encoding in entities, sometimes embedded in CDATA and sometimes stores as-is in the XML.
6 – Normalization
XML encryption and signature look fine on paper. But as soon as you dig in the spec, you realize that it’s not so easy because of the syntactic and semantic equivalence of XML document. Is <my-tag></my-tag>
the same as <my-tag/>
? To solve this issue, XML normalization was introduced which define the canonical representation of a document. Good luck to understand all the subtleties when considering remarks #1, #2, #3 and #5.
7 – Too many API and implementations
Even if stuffs improved in this area, there are too many API and implementation available. I wish there was one unified API and one single implementation sometimes…Ever wondered how to select a specific implementation? Ever got a classloader issue due to an XML library? Ever got confused whether StAX was actually really better than SAX to read XML documents?
8 – Implementation options
Most XML implementations have options or features to deal with the subtleties I just describe. This is especially true for namespace handling. As a consequence, you code may work on one implementation but not on another. For instance, startDocument
should be used to start an XML document and deal with namespace correctly. The strictness of the implementations differs, so don’t take for granted that portability is 100%.
9 – Pretty printing
There are so many API and frameworks that it’s always a mess to deal with pretty printing, if supported by the framework.
10 – Security
XML was not designed for security. Notorious problems are: dangerous framework extension, XML bomb, outbound connection to access remote schema, extensive memory consumption, and many more problems documented in this excellent article from MISC. As a consequence, XML document can be easily abused to disrupt the system.
11 – XPath and XSLT
XPath and XSLT belong to the XML ecosystem and suffer the same problems as XML itself: apparent simplicity but internal complexity. I won’t speak here about everything else that surrounds XML and that forms the big picture of the XML family specifications. I will just say that I recently got a NPE in NetBeans because “/wsa:MessageID
” was not ok but using “/wsa:MessageID/.
” was just fine. Got the point?