RLG DigiNews
BROWSE ISSUES
SEARCH
RLG
   
  December 15, 2003, Volume 7, Number 6
ISSN 1093-5371

 

PDF/A: Developing a File Format for Long-Term Preservation
William G. LeFurgy
U.S. Library of Congress[1]

A committee of government, business, and academic representatives is exploring a promising approach to support long-term preservation of text-based digital documents. Originally sponsored in 2002 by the Association for Information and Image Management (AIIM) and the Association for Suppliers of Printing, Publishing and Converting Technologies (NPES), the committee prepared a draft preservation standard for Adobe’s Portable Document Format (PDF). Known as PDF/A, the potential standard intends to specify the use of PDF in a manner that is specifically geared to long-term management and use. PDF/A would be ideally suited for documents whose content and appearance must remain stable over long periods of time. A newly formed Joint Working Group (JWG) of the International Organization for Standardization (ISO) has accepted the draft as the basis for further development as an approved standard.[2]

breakout quoteFew file formats are currently suitable for long-term preservation. A format is often controlled as the intellectual property of a commercial entity, which typically has a vested interest in hiding the underlying code base. Competition drives frequent change in individual formats as well as the companies that control them; information technology overall is also undergoing continuous transformation. This combination of opacity and change means there is no assurance that future technology will support today’s many formats. Indeed, tomorrow’s digital landscape will surely be littered with objects that are nightmarishly difficult to preserve, access, and interpret. Addressing this problem is a critical challenge for libraries, archives, and other organizations that maintain digital content.

One potential solution is to rely on text with markup language such as the Extensible Markup Language (XML) to preserve documents. This offers some important advantages, chiefly that textual content will achieve a degree of independence from specific information technology configurations. But use of XML does not always ensure reproduction of the original visual appearance of documents. This is a particularly significant issue in situations where a document in another format is migrated to XML. Textual content may be reasonably well represented in the XML version, but much of the original document’s formatting and layout will likely be lost. If a word processing document is moved to XML, for example, instructions relating to line and page breaks, font characteristics, footnote placement, margin width, and other format-specific elements will either not migrate at all or require support for complex (and often proprietary) style conventions. Encodings used for metadata, versioning, and other identification and tracking features may also not migrate. A further problem is that XML is sharply limited in its support for nontextual data such as photographs and other graphics.

Some communities have already made the determination that, for reasons of authenticity and trustworthiness, it is necessary to retain both the content and physical appearance of digital documents. Other communities are interested in facilitating long-term preservation by relying on one file format throughout the digital life cycle. Most everyone is interested in a format that enables robust metadata.[3] For text-based objects, these interests can be summarized as a set of three basic requirements:

  1. The needs of document producers. Documents must be easy to create, compatible with workflow processes, and flexible enough to include images, subdocuments, and other components.
  2. The needs of document users. Documents must be reliable, appropriately functional, and discoverable from different approaches (e.g., index terms and full text).
  3. The needs of cultural heritage institutions and others concerned with long-term document preservation. Documents must be based on transparent and stable technology and suitable for guidelines issued to producers (e.g., guidance for activities such as document creation and submission). In addition, files must support metadata for access, provenance, and preservation.

PDF addresses most of these requirements. It is widely integrated into many document producer work environments. Users are quite familiar with PDF from its ubiquitous presence on the World Wide Web. Some cultural heritage institutions favor PDF because it is based on a published specification; this permits independent development of non-proprietary tools for rendering documents. By publishing the specification, Adobe has managed to avoid a key preservation problem with most other commercial software: barriers (technical as well as legal) for users to decode information content contained in files. In addition, the format retains the appearance and other features of digital documents that may constitute significant properties (such as layout, formatting, and “look and feel”). The most recent PDF version also offers a rich metadata capability known as the Extensible Metadata Platform (XMP), which is based on the XML and Resource Description Framework (RDF) specifications of the World Wide Web Consortium (W3C).breakout quote

Despite its advantages, unrestricted PDF is not suitable as an archival format. Adobe controls its development and is under no obligation to continue publishing the specification for future versions. The format includes some features that are incompatible with preservation purposes. PDF documents, for example, are not required to be self-contained; certain fonts may be drawn from outside the file.[4] The work of the ISO PDF/A JWG committee is to define the set of PDF components that may be used and restrictions on their use to support long-term preservation. For example, the draft ISO PDF/A standard is distinct from PDF in that:

  • Audio and video content are forbidden
  • Javascript and executable file launches are prohibited
  • All fonts must be embedded and also must be legally embeddable for unlimited, universal rendering
  • Colorspaces must be specified in a device-independent manner
  • Encryption is forbidden

As currently written, the draft ISO PDF/A standard intends to specify a format for representing documents created natively in PDF, converted from other digital formats, or digitized from paper or microfilm. The standard will also support development of products that read, render, write, and validate conforming PDF objects. Sections currently are provided for file format (such as file header/trailer, string and stream objects, and other base elements that form the general file structure); graphics; fonts; annotations; actions (including treatment of hyperlinks); metadata; logical structure; and forms.

The metadata section relies on XMP, which provides for broad and flexible document characterization. From an archival perspective, XMP shows much promise for purposes of description, provenance (e.g., history of the document and its context), preservation, and administration. There are also some key technical advantages for digital preservation because XMP metadata is 1) embedded in each file as plain text, which both lessens the possibility of loss and simplifies access, and 2) structured and represented in a manner that conforms to W3C specifications. As with XML and RDF, XMP permits user-defined schemas to describe metadata properties. This offers the prospect of rich metadata that is widely interoperable and interpretable over time. Currently XMP does not provide for machine-readable schemas, which severely limits validation of metadata against applicable schemas. A major problem here is the pending status of the RDF schema specification. W3C is, however, making progress toward formal approval of the specification.

If approved as an ISO standard, PDF/A could have an important role in digital preservation. The format promises to be widely suitable for creating and distributing documents, recording evidence of transactions, searching and retrieving, and many other common uses. This means repositories will be able to ingest and manage documents in their original format, which is important from both a cost and authenticity standpoint. PDF/A will also support migration of documents from other formats for long-term retention.

Notes

[1] The author is reporting from the perspective of a member of the original AIIM/NPES PDF/A committee and the U.S. Technical Advisory Group to the ISO PDF/A JWG and is not representing any official position of the U.S. Library of Congress. (back)

[2]Background and other details associated with the AIIM/NPES committee and its association with ISO is available; information about formal ISO status is available. (back)

[3]A sampling of these and other community interests is discussed in “E-Documents Need E-Preservation,” Washington Technology, 3/3/2003. (back)

[4]For a more-extensive discussion of potential preservation issues associated with unrestricted PDF, see “Archiving and Preserving PDF Files,” RLG DigiNews, 2/15/01. (back)


Publishing Information

RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org.

Materials in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given to use material found here for research purposes or private study. When citing RLG DigiNews, include the article title and author referenced plus "RLG DigiNews." Any uses other than for research or private study require written permission from RLG and/or the author of the article. To receive this, and prior to using RLG DigiNews contents in any presentations or materials you share with others, please contact Jennifer Hartzell , RLG Corporate Communications.

Please send comments and questions about this or other issues to the RLG DigiNews editors.

Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); Technical Researcher: Richard Entlich; Contributor: Erica Olsen; Copy Editor: Martha Crowe; Production Coordinator: Carla DeMello; Assistant: Valerie Jacoski.

All links in this issue were confirmed accurate as of December 15, 2003.

 

   
 
RLG DigiNews
BROWSE ISSUES
SEARCH
RLG