A tool is useless without something to work on. So what do we shape with our computing tools? Data, information, knowledge, opinions, art – in short: Content. Content is created, processed and transmitted. Nowadays much more often directly in some electronic format. The number of people who own devices and connect to the internet is constantly rising. And they use it to evolve their ways of working together.
Content is sent from one user to another and back. To do this, the content must take on some form: The data-format. This defines how content and its wrapping are to be handled, what is allowed and how each part looks within a file or stream. Anyone who wants to participate in the data exchange must use a software application that understands the data-format in question. Otherwise the content would appear like an unknown foreign language to your computer. If a data-format does not allow for the inclusion of pictures, for example, then there is no way to include pictures with it. The choice of data-format dictates the number of years for which may access the content (backwards compatibility) and what I am able do with it.
A single user will probably not feel any effect of her decision when saving a file in a particular data-format. When an IT-department or a public administration decides upon a data-format the impact is far greater: It will dominate their choice of software for several years, possibly decades. The more an organisation saves its precious writings, recordings or pictures electronically, the more important it becomes to secure continued access to the documents. These decisions, directly or indirecty, lead to the funding of the initial development and maintenance of data-formats, wether they be "good or bad" formats. The choices taken at one time naturally affect the available choices in the future: Many software producers intentionally try to influence users to use a data-format that they (the producer) control. For example when technical schematics of vehicles, buildings or machinery are all held in a format controlled by the software producer, the producer of the CAD application can in essence hold the data for ransom when its time to renew the contracts. From the vendor’s point of view this is a strong position to be in for the next pricing negotiation. Occasionally, whole countries have managed to maneuver themselves into the losing end of this situation.
As you can see, a good data-format can only be an Open Standard. This requirement alone, however, is not enough. The data-format needs to solve a problem adequately: It should be a good fit from a functional point of view, as well as on a technical level. In order to judge this, there are a number of things to consider. The Essay by Bert Bos explains the design principles of the W3C - the organisation which develops the formats of the world wide web. He mentions efficiency, maintainability, accessibility, extensibility, learnability, simplicity, longevity and a few more.
Two central questions here are:
The first question is self-explanatory: Whoever wants to save, transmit and search within a text would not want a format for pixel based images – though it would be inevitable to use such a format during the first step of scanning papers or incoming faxes.
The second question is much more interesting: Is the format as simple as possible and as complicated as necessary? It is very hard to design or choose a data-format which follows this principle of minimalism.
Firstly, there is the anti-pattern of “Design by Committee”, where several decision makers participate in each decision. Decisions about which software product to use within an organisation – especially in public ones – are also often made by large committees. Then it easily happens that too many cooks spoil the broth and add more into the standards than is actually necessary. The W3C at least is aware of this pattern. Many groups are not.
A second problem is the common use of checklists when evaluating software solutions. Typically it goes like this: Every stakeholder can add something to the list; the given wishes are often specific solution ideas and get condensed into the checklist for the procurement departement; the software product promising to fulfill most of the items on the checklist, wins; most of the time this means buying a single data-format which has many, rarely used and unneeded, features. It would be better to add features with a focus on the problem (rather than the solution) to begin with. The evaluation process should reward higher grades for solutions which consist of a number of simple, easily extensible and complementary data-formats which can be combined for the more complex needs.
But software vendors know their customers: The more features on a checklist are ticked off, the more precious a software will appear. That is because it seems to – at first glance – serve many needs. Except for the need for simple elegance. And so this is what the software and the data-format will look like: Bloated with many features, to reflect as many specific solution ideas as possible. This gives the software producer another advantage: Any competitor will have a hard time supporting the full feature list of the format, or provide a superior solution to just a few elements. The customer is forced to buy all or nothing. Why bother with another data-format when there is already that claims to do everything?
Every additional feature or guideline complicates the description of the data-format exponentially. The disadvantages of bloated formats are enormous. The developers of a software which needs to handle a data-format must understand the description in total: this includes the complete text of the specification and then all possible combinations of its elements. Having to read and understand less means the resulting software implementation will be simpler and more accurate. This leads to more software which can handle the data-format on a high level. What follows is more competition, choice and therefore more users of this format.
The more complex a data-format is, the greater the chance that it has rarely needed features. So the format and the implementation are comparable to a huge and sprawling mansion: Some rooms are very popular and well-frequented, while other places are hardly ever visited by people. Of course such a house is harder to secure. Burglars could push open a lonely forgotten window to the basement or hide tools in a cobwebbed corner during an official visit to the premises.
Experts see complexity as the biggest threat to software security. This is why many of them are critical or even hostile towards standards. 1
To get an understanding of the risks let us take a look at how a computer deals with written characters. A commonly used standard is Latin-9 (ISO/IEC 8859-15). It enables a computer to handle text in more than 20 languages - mostly western European ones. For a single electronic character, encoded in Latin-9, there are 256 different possible values it can have. A new standard called Unicode (ISO 10646) is supposed to encode all languages of the world. Therefore it comes with more than a million possible values per character. To make things worse, a single character could be encoded in several different ways. For example in "UTF-8" or "UCS-2". On one side Unicode is a blessing: Once implemented correctly an application is prepared to handle hundreds of languages. On the other hand a programmer cannot fully calculate in her head all the effects a character might have when looking at the source code of a software. With the 256 cases of Latin-9 she could. With Unicode the possibility of overview gets lost. A clever attacker might find combinations the developer did not think of. This happens on a regular basis. Here are two examples: 1. the IDN homograph attack plays tricks on the users with similar looking Internet addresses. Cyrillic from the Unicode-Fonts is well suited to this. 2. The developers of a well known webserver fell prey to the possibilities of Unicode in URLs.
Unsurprisingly there are more applications out there that can handle Latin-9 better than Unicode. It is the same problem with every "fat" data-format: There are applications that do not understand the more exotic features, if not just because it has become impossible to test the myriad of features. The software will advertise that it can read data-format “X” - but whether this works in practice is questionable.
Some data-formats create this problem on purpose: They come in different revisions. To be sure that software packages are compatible, the user has to define the precise version of the data-format used. For example there are three variants (1.0, 1.1 and 1.2) of the Open Document Format (ODF). It is likely the complexity grows with the number. Certainly there are many cases where using the features of version 1.0 would be completely okay. But the default probably is to save files in the newest version the software supports. For PDF this problem is even more significant. Some versions or parts of PDFs do not even make an open standard.
Most texts that are being exchanged only need a small fraction of that what common data-formats have to offer in terms of formatting, mark-up or layout. A simple file composed of Latin-9 characters can be edited since decades on every computer by means of a simple text editor or any word processor. A small subset of HTML 2 could cater for advanced needs like headlines, bullet-lists and hyperlinks. Alternatively any simple textbased markup language like used by Wikis would work for many tasks. The Wikipedia pages and web-logs ("blogs") of the world are proof that lot of content can be expressed by simple means.
Everyone – except vendors of proprietary software – profits from different software products competing which each other, while being secure and interoperable. The minimal principle for data-formats promotes all this. It just has one rule: Remove everything that is not absolutely necessary. Aim for your design to be simple and elegant. A good solution resembles as set of building blocks where an infinite number of buildings can be made, just by combining a few types of elements.
Even though there may be good reasons to choose a data-format which covers several requirements we should ask ourselves each time: “Can this be done more simply?”