Why is XML better than custom internal formats? Isn’t a non-standard set of XML tags basically equivalent to a custom format when using data internally to an application? Absolutely not, and here is why.
Why the Article
I still regularly run into the idea that there isn’t really much to XML. Essentially XML is just another text format that happens to be popular and in many ways is equal or inferior to any number of other formats.
Of course this is true. XML is just another text format that is arbitrarily as good as pretty much anything else. Fortunately for all of us this completely misses the point.
So for the sake of this discussion let’s define an trivial language. Lets say all I want to do is define a list of Names, ages, and e-mail addresses. There are only three bits of information in one record.
I could write them as a simple, comma-delimited file:
Michael Smit, 12, email@example.com Sally Wunder, 53, firstname.lastname@example.org Bill Nightly, 60, email@example.com
Or I could write a nice complicated XML document
<?xml version="1.0" encoding="UTF-8"?> <email-list> <entry> <name>Michael Smit</name> <age>12</age> <e-mail>firstname.lastname@example.org</e-mail> <entry> <entry> <name>Sally Wunder</name> <age>53</age> <e-mail>email@example.com</e-mail> </entry> <entry> <name>Bill Nightly</name> <age>60</age> <e-mail>firstname.lastname@example.org</e-mail> </entry> </email-list>
The first example is easy to conceive, write, and parse while the second is more complicated, verbose, and contains additional syntax that doesn’t have anything to do with my data. So given a simple, easy to parse bit of information, why would I choose the second example?
Parsing Your Data
Characters With Special Meaning in Your Data
Well the first reason is that even the most simple format will run into data which can confuse the parser. Even this straight forward format will have an issue if someone’s same has a ‘,’ in it. Take
Ed Bingley, Jr..
As a standard, XML rigorously defines escape sequences for all the special characters in the language. All the parsers already understand these characters and anyone who has used XML before will as well.
Expanded Character sets in Your Data
Even if you handle the escape character issue, you still have to deal with non-ASCII characters. What about the ü?. Either you have to move to a non-ASCII format (which some editors will not be able to handle and some languages do not handle well) or define these escape characters yourself.
As a standard, XML defines codes for non-ASCII characters which future-proof you against not only characters found in your own language, but against text in characters sets from other languages as well.
Validating Your Data
How do you verify that a document format matches the format you expect? In an ideal world your datasets are always clean, but of course that isn’t reality. This is particularly true of human input.
For example what about this record?
Michael Smit, Twelve, invalidemail, Sally Wunder, 53, e-mail@something
There are a bunch of problems here. We need to validate
- Each line has the right number of columns
- Each column has the right fundamental type of data (string, number, etc)
- Each column has data which is plausibly correct.
As a standard, XML has two well-known and well-documented languages (xml Schema and Relaxng) for describing what a “valid” document is in your particular format of XML file. When you write code that operates on a validated document you can save a lot of error-checking cruft because you know that the basic format of the file is correct(i.e. age IS a number).
Representing Your Data
Once you’ve correctly parsed your data accounting for escape characters (which many coders don’t), and validated the data is of the types you expect (which often isn’t done), and translated that data from strings into whatever target format you want, you still have to package that data into some sort of structure that describes what you’ve parsed.
As a standard, XML has a number of tools like jaxb in the java programming language which can take the validation document for an XML format and automatically generate code which can be used to read, parse, validate, and package your data into nice clean objects where the name of the object and it’s properties directly map to the names found in your format
Reading Your Data in Other Programming Languages
Let say you’ve defined your escape characters, written your validation code, translated your data into internal structures which clearly describe the data, and unit tested and debugged that code. Now you need to read that data in another 2 programming languages. Prepare to start all over again
As a standard, XML can be parsed, validated, and packaged in much the same way in almost any serious programming language.
Using Your Data
Searching Your Data
So now you have your data, what do you want to do with it? If you want to search the data in any but the most trivial way you’r going to have to write a fair amount of code there as well. What do you want to search for? What language are you going to use to represent the search? Are you going to allow the user to search for combinations of conditions?
Very quickly you end up writing a lot of code or providing a fairly stunted search capability.
As a standard, XML supports XPath which can search for pretty much anything you could possibly want to find in your xml document in a well-defined language that anyone who has searched XML before will be familiar with.
Presenting Your Data
Even if you find the data you are looking for what format do you want to present the result in? Are you sure that is the only format the customer is going to want? Are you going to have to write a new bit of code for each new format?
As a standard, XML provides powerful translation languages such as XSLT and DVSL which can be used to convert one XML format into another with a minimum of code. When you produce XML your customers can write these translations themselves because they have the tools to read/validate/package/transform your data.
Maintaining Your Data and Code
Finally, custom code introduces cost to your system. All that code for parsing, validating, storing, searching, and translating your data has to be tested, debugged, updated, documented, and managed. Every new little message format you create in your system will have the same overhead. Every new developer will have to learn how you have decided to do all of these things.
When you find a problem in one message type (i.e. doesn’t handle special characters) and fix it, you have only fixed the problem in one instance in one place. Who knows how many other programs in your product do the same thing?
As a standard, XML and the tools that are developed for it are of high quality, do not require you to spend time and effort maintaining them, and apply to any other project using XML even if you haven’t seen that project before.
So, is XML the best possible standard? No. Is XML verbose and at times obscure? Absolutely. Will it be faster in the short term to define a comma delimited file? If you’ve never used the XML tooolset before this is probably true.
Do these temporary shortcomings overcome the extensive benefits of using a standard? Not in my experience.