RLG
 Feature Article 3  

The Bits and Bites of Data Formats—Stainless Design for Digital Endurance

Author: Andreas Aschenbrenner - ERPANET (aschenbrenner@student.ifs.tuwien.ac.at)

Editors' Note
XML has been at the center of numerous digital preservation discussions for the past several years. This piece raises some caveats and concerns regarding the practical use of XML for preservation purposes. In future issues, we will continue to provide additional takes on XML and other potential digital preservation solutions.

When archiving digital information over the long-term, we are confronted with the rapid pace of technology. Software formats vanish into obsolescence even faster than tangible elements do. In addition to their irritatingly short expiration date, the variety of formats makes reusability of information and system independence a faint hope.

breakout quote

Many people point to standard formats as a solution that will facilitate deciphering objects into the future. Undoubtedly standards may alleviate some problems in enabling digital preservation. Individual data formats, however, are designed for specific reasons. Conversion from one format to another may entail loss of some of these features. Loss of information is the most-imminent risk, though not the only one.

After discussing some of the raisons d'être of data formats from a technical perspective, this article highlights restrictions in translating between formats, specifically in the example of XML, [2] and investigates implications for digital preservation.

In pointing out the idiosyncrasies and the use of specific data formats, this text concludes with a positive supposition: data formats are not an obstruction to digital preservation in themselves. No format is more equal than the others, but some formats are more appropriate in some environments and for some requirements than others.

Design Criteria for Data Formats

Specific design goals guide the definition of a data format. Software dependency is a side effect of proprietary formats that may, as some speculate, be welcomed by some profit-oriented vendors, but there are more-objective design criteria and motivators that explain the myriad of existing formats, as suggested by the list below: [3]

  1. The main reason for defining a data format is to store information.
  2. Taking a closer look at this, the format is often a container for information at different levels. Besides the actual content, the format may also store information that controls specific functionality of a software application. [4]
  3. Another design criterion may be the size of the resulting data object. While storage space is getting cheaper by the day, the volume of digital objects grows excessively as well. At the same time, the larger the object, the longer its transmission takes via a network. So saving storage remains a consideration.
  4. Implications for the performance of a system may need to be considered. A data format may be required to facilitate efficient access to and manipulation of information. For that purpose, a format may be geared towards a specific application.
  5. In some instances, extensibility and generality of a data format may be highly desirable.
  6. Other formats may influence the design of a new data format with the aim of allowing for compatibility. Although backward or forward compatibility, for example, is unlikely to be sustained over long periods of time and a number of technology generations, it may play a role in the definition of a data format.
  7. Specific measures could be taken to ensure the integrity of information. These include incorporating design features supporting robustness against data loss and impeding unauthorized tampering of the information. Also, security or confidentiality concerns could drive design decisions.

This list of design criteria is not exhaustive. It is, however, sufficient to explain the current multiplicity of data formats. At the same time, it explains their rapid evolution, which goes in hand with the progressively changing requirements that are the basis of these criteria. Even standard formats will not survive for eternity.

Note that there is tension between some of the above criteria. For example, a data format that is compressed in size (3) may be slow to encode (4). Some redundancy in formats, which is unfavorable for their size (3), may be conducive to their integrity (7). There are, however, various ways to reach one's ends. Instead of requiring inherent redundancy, the robustness of a format may also be enhanced by more-explicit measures, for example, by the use of cyclic redundancy checks.

Some of the criteria are inherent in the format, while others may be external or supplementary. For example, cyclic redundancy checks could be part of the file format or stored externally in the system environment in which the object is embedded. Similarly, information about the functionality of a specific object could be stored partly in the object itself and partly externally. Likewise, compression algorithms can be used to manipulate the size of an object after its creation.

breakout quote

This discussion demonstrates that a small number of standard formats can hardly embrace all requirements of individual software applications. Moreover, these requirements may change during the existence of a digital object; its required features are not necessarily the same in active use as when finally archived. Although a standard format may be too restricted for use in active applications, its features may satisfy preservation requirements.

XML is often and for good reasons promoted as a standard format for preservation purposes. However, an XML-based preservation format is inappropriate for some data types and in some circumstances, as the following section will highlight based upon a review of the above criteria.

Putting XML to the Test
The advantages of XML have been exhaustively discussed [5] and are undeniable. In particular, the features of XML that foster human readability and system independence are invaluable. Some loose ends remain, however.

A word on syntax, structure, and semantics

XML defines a surface syntax for structured documents—their notation and basic structural rules. For constraining the structure of XML documents the World Wide Web Consortium (W3C) developed the language XML Schema [6]. An XML Schema can be associated with an XML document to specify which structural elements may occur at what point. As such, the Schema definition may be compared to the actual definition of a data format with an XML syntax. In other words, XML on its own is insufficient to serve as a complete data format.

Given the need to impose structure on XML, there is the risk of a variety of XML Schemas being defined, each for a slightly different use. That is happening at the moment; current initiatives tend to embark on developing Schemas on their own and only for their own use. The flood of XML Schemas, however, encumbers digital preservation as well as reusability and interoperability just as the myriad of other data formats does.

Moreover, both languages in tandem, XML and XML Schema, are insufficient for expressing the semantics—i.e. the meaning—of a digital object. Other languages from the toolkit of the W3C Semantic Web Activity are further building blocks for facilitating machine understanding and automation on a basic semantic level. But eventually exhaustive documentation of a data format is necessary to make it human-understandable and to preserve its meaning.

Considerations for size and performance

In the design of an XML Schema, the above criteria need to be considered as for any other data format. The application of XML alone does not ensure a desirable and reliable format. This is even more true for the criteria of size and performance. The XML syntax makes it difficult, if not impossible, to define formats that are small in size and/or facilitate performance.

Difficulties in converting formats to XML

Perhaps most important, a number of data types cannot reasonably be translated to XML. Take, for example, an image format. It is, of course, possible to mark up an image in XML:

‹image›‹pixel›‹position›‹x›1‹/x› ‹y›1‹/y›‹/position›‹color›
‹red&saquo;10‹/red›‹green›5‹/green›
‹blue›0‹/blue›‹/color›‹/pixel› ‹pixel›... ‹/image›

In another data format, the same might be expressed as

10,5,0,...

A more-appropriate solution would be to select one of the widely available standard formats for images such as TIFF, PNG, or JPEG2000. But different image formats serve different purposes, too.

There are other kinds of data that have not been considered for translation or simply cannot be translated into XML, including audio, video, and 3D simulation models. Similarly, large repositories of scientific data may deliberately choose not to adhere to XML.

Considering human readability

When reviewing XML's acclaimed feature of human readability, we find that it is not inherent in the XML format. Of course, XML leverages human readability. However, its elements must be named such that they are also human-understandable.

To underline this argument, we take the example above. We have seen that the version of the image marked up in XML is quite a bit longer. Let us try to improve on that:

‹i›‹p1›‹p2 x=”1” y=”1” /›‹c r="”10” g=”5” b=”0” /›
‹/p1› ‹p1› ... ‹/i›

breakout quote

So it is possible to produce slightly more compact XML code—this statement is half as long as the initial XML example above. Although this is beneficial for the criterion of size, it encroaches on human readability. It is now not obvious that ‹P1›stands for a pixel and ‹P2”for the position of this pixel. To reiterate, XML does not automatically mean human-readable. There may, in fact, be non-XML data formats that are more easily understandable—perhaps with the assistance of a brief external explanation. The importance of documentation is therefore emphasized in this context. The wide gap between human-readable and human-understandable needs to be bridged by suitable documentation.

So it is possible to produce slightly more compact XML code—this statement is half as long as the initial XML example above. Although this is beneficial for the criterion of size, it encroaches on human readability. It is now not obvious that < P1 >stands for a pixel and < P2 >for the position of this pixel. To reiterate, XML does not automatically mean human-readable. There may, in fact, be non-XML data formats that are more easily understandable—perhaps with the assistance of a brief external explanation. The importance of documentation is therefore emphasized in this context. The wide gap between human-readable and human-understandable needs to be bridged by suitable documentation.

Reusability and interoperability

Like human readability, reusability of digital objects and interoperability are not inherent properties of XML either. Again, it depends on whether two partners in an interaction adhere to a common XML Schema. Deciphering an XML file demands even more than an XML Schema: the meaning of marked-up information must be understandable, which goes beyond mere structural definition. As a particular manifestation of this requirement, a proprietary file format wrapped in XML remains proprietary. Generally speaking, just because a file is XML-based doesn't mean it will be open. [7]

Final thoughts

In summary, XML is not a one-size-fits-all solution. In the end, the human designer will determine if its advantageous features can be exploited. In some situations and for some data types, it may prove better to use either a standard format or possibly a dedicated data format instead of one based on XML.

breakout quote

Many digital objects, however, will be converted to an XML-based format for preservation—for example, text-processing documents, which represent a huge percentage of the overall mass of digital objects. Text-processing software employing XML formats, such as OpenOffice, are particularly interesting for preservation initiatives in this context. These XML formats will have to be further evaluated to determine their human-readability. These formats may prove viable for preservation purposes because of the availability of authoritative technical specifications and their capacity to conserve significant properties without including large quantities of unnecessary elements.

Moreover, the possibilities offered by the XML family of specifications are only now being explored. XML in combination with RDF and other standards being developed by the W3C offer powerful possibilities. Employed with thoughtfulness and diligence, XML-based formats, like other data formats, will be an important component of a preservation solution.

Data Compression in Digital Preservation

Compression algorithms were developed for those data formats to produce small-sized files. From this perspective, the compression of a data file is simply the translation of a data format into another. Preserving compressed objects is consequently a manageable challenge. The same measures that are taken for any other data format ensure the accessibility of compressed information. Obviously, compression algorithms may become obsolete just like any other technology. The conversion from one algorithm to a new one thus needs to be completed before it's too late.

Nonetheless, applying compression algorithms may be less risky than converting from one data format to another. As the name implies, compression algorithms can be expressed in mathematical terms. This reduces the risk of inadvertently losing information in the compression or decompression process. Moreover, the compression tool can be immediately tested to see whether it correctly implements the algorithm. With suitable technical documentation, the digital objects can be decompressed now and in the future.

Usually the inversion of the compression algorithm yields the original data. Some algorithms are lossy [8] in the sense that they reduce the quality of the original in the process of compression. This may not be acceptable when digital objects need to be preserved, and lossless compression algorithms should be chosen. [9]

In a nutshell, compression algorithms do not obstruct digital preservation if the algorithm is carefully selected, preservation methods such as migration are applied with care, and exhaustive documentation is retained.

Encryption in Digital Preservation

Considerations for managing data formats apply equally to the encryption of digital objects as they do for compression. Similarly, preserving encrypted digital objects is not an unmanageable challenge provided the necessary precautions, described above, are taken. The importance of documentation and metadata cannot be emphasized enough at this point, as one missing piece in the jigsaw puzzle may prevent access to an object in the future. A central component when preserving encrypted objects is the key to decrypt it again. The key must be reliably preserved in a secure place.

breakout quote

Conclusion

Despite the obvious advantages of standard formats for digital preservation, as well as reusability of data and interoperability of systems, they have to be applied with due consideration. Standard formats may in some situations fall short of specific requirements. In the design or choice of a data format, the above criteria need to be taken into account.

Moreover, as part of the above criteria, a preservation format must adequately preserve the intellectual content of a digital object. The elements included in an object's intellectual content may be defined in its significant properties, a term coined by the digital preservation project Cedars. Here it is important that significant properties be defined for individual objects in specific environments. Generic significant properties defined for a data format fail to address the preservation requirements of each and every organization. For instance, the formatting of a report may be integral to one organization, while, for another, the retention of a plain-text transcription suffices. Or, as another example, in some contexts the specific functionality of software that reflects in the data format may be important, which is considered extraneous in other environments using the same software. In these cases, the preservation formats of the two organizations will differ even though they both started from the same active data format. More than that, different processes in the same organization may raise different preservation requirements. This may indeed lead to varying preservation formats for the same active data format in the same organization.

All this calls for a more-careful selection of data formats, comprehensive documentation of them, and active management of the digital objects throughout their existence. To attain interoperability, all stakeholders have to present their requirements, and subsequently a suitable format can be chosen or designed in a cooperative effort. This may sometimes be a painful process, and it is unlikely that there is any format that satisfies all variations of preservation needs for a specific data type. Installations such as registries that are being developed for metadata [10] and for file formats [11] may offer the possibility of sticking to local variations in formats while at the same time allowing interoperability on a more global level.

On the whole, data formats can be useful tools that even support a specific preservation solution. Bearing in mind, however, that since a screwdriver should not be used to drive in a nail, a careful selection of tools is paramount. [12]

Notes

[1] The author is the Dutch content editor for ERPANET.

[2] As will become clear in the following, the XML language alone is not a data format, but, together with other members of the XML family, it is a tool for defining one.(back)

[3] This article assumes that stakeholders are interested in interoperability, promote openness, and strive to work cooperatively towards preserving their information for future generations.(back)

[4] Take an instruction in a text-processing format, for instance, that prompts the document to open at a specific size— the content of the text does not change if this information is missing; or an image format that has the capability to store image data in different layers—the layers are not recognizable when the image is viewed. Software needs dedicated information within the data format, however, to provide such functionality.(back)

[5] One of numerous initiatives discussing the advantages of XML in preservation is the Digital Preservation Testbed in its white paper XML and Digital Preservation (September 2002).(back)

[6] Formerly, document type definitions (DTDs) were used. They are currently being superseded, however, by XML schemas (.xsd). Refer to the W3C Web site for more information.(back)

[7] The IDA (Interchange of Data between Administrations) Open Source Migration Guidelines. Guidelines funded by the European Commission. (October 2003), p. 24.(back)

[8] While lossy compression may be possible in areas other than image compression, it is not sensible for text compression, for instance. For more-detailed information, refer to Wikipedia.(back)

[9] Even if a loss of quality may appear acceptable in the present, future use may demand the original quality. For preserving digital objects, lossy compression should therefore be applied only after careful deliberation.(back)

[10] Michael Day. Integrating Metadata Schema Registries with Digital Preservation Systems to Support Interoperability: A Proposal. In: Proceedings of the 2003 Dublin Core Conference: Supporting Communities of Discourse and Practice—Metadata Research and Applications, Seattle, Wa., USA, 28 September-2 October 2003.(back)

[11] Stephen L. Abrams and David Seaman. Towards a Global Digital Format Registry. World Library and Information Congress: 69th IFLA General Conference and Council, August 1-9, 2003, Berlin, Germany.(back)

[12] Other technical format specifications, including protocols and interface definitions, are similar tools, and the issues in this text apply to them analogously.(back)


Copyright 2004 RLG.