![]() |
|
|
|
|
|
|
|
|
|
||
![]() |
||
| December 15, 2003, Volume 7, Number 6 | ISSN
1093-5371 |
|
|
PDF/A: Developing a File Format for Long-Term Preservation A committee of government, business, and academic representatives is exploring a promising approach to support long-term preservation of text-based digital documents. Originally sponsored in 2002 by the Association for Information and Image Management (AIIM) and the Association for Suppliers of Printing, Publishing and Converting Technologies (NPES), the committee prepared a draft preservation standard for Adobe’s Portable Document Format (PDF). Known as PDF/A, the potential standard intends to specify the use of PDF in a manner that is specifically geared to long-term management and use. PDF/A would be ideally suited for documents whose content and appearance must remain stable over long periods of time. A newly formed Joint Working Group (JWG) of the International Organization for Standardization (ISO) has accepted the draft as the basis for further development as an approved standard.[2]
One potential solution is to rely on text with markup language such as the Extensible Markup Language (XML) to preserve documents. This offers some important advantages, chiefly that textual content will achieve a degree of independence from specific information technology configurations. But use of XML does not always ensure reproduction of the original visual appearance of documents. This is a particularly significant issue in situations where a document in another format is migrated to XML. Textual content may be reasonably well represented in the XML version, but much of the original document’s formatting and layout will likely be lost. If a word processing document is moved to XML, for example, instructions relating to line and page breaks, font characteristics, footnote placement, margin width, and other format-specific elements will either not migrate at all or require support for complex (and often proprietary) style conventions. Encodings used for metadata, versioning, and other identification and tracking features may also not migrate. A further problem is that XML is sharply limited in its support for nontextual data such as photographs and other graphics. Some communities have already made the determination that, for reasons of authenticity and trustworthiness, it is necessary to retain both the content and physical appearance of digital documents. Other communities are interested in facilitating long-term preservation by relying on one file format throughout the digital life cycle. Most everyone is interested in a format that enables robust metadata.[3] For text-based objects, these interests can be summarized as a set of three basic requirements:
PDF addresses most of these requirements. It is widely integrated into
many document producer work environments. Users are quite familiar with
PDF from its ubiquitous presence on the World Wide Web. Some cultural
heritage institutions favor PDF because it is based on a published specification;
this permits independent development of non-proprietary tools for rendering
documents. By publishing the specification, Adobe has managed to avoid
a key preservation problem with most other commercial software: barriers
(technical as well as legal) for users to decode information content contained
in files. In addition, the format retains the appearance and other features
of digital documents that may constitute significant properties (such
as layout, formatting, and “look and feel”). The most recent
PDF version also offers a rich metadata capability known as the Extensible
Metadata Platform (XMP), which is based on the XML and Resource Description
Framework (RDF) specifications of the World Wide Web Consortium (W3C). Despite its advantages, unrestricted PDF is not suitable as an archival format. Adobe controls its development and is under no obligation to continue publishing the specification for future versions. The format includes some features that are incompatible with preservation purposes. PDF documents, for example, are not required to be self-contained; certain fonts may be drawn from outside the file.[4] The work of the ISO PDF/A JWG committee is to define the set of PDF components that may be used and restrictions on their use to support long-term preservation. For example, the draft ISO PDF/A standard is distinct from PDF in that:
As currently written, the draft ISO PDF/A standard intends to specify a format for representing documents created natively in PDF, converted from other digital formats, or digitized from paper or microfilm. The standard will also support development of products that read, render, write, and validate conforming PDF objects. Sections currently are provided for file format (such as file header/trailer, string and stream objects, and other base elements that form the general file structure); graphics; fonts; annotations; actions (including treatment of hyperlinks); metadata; logical structure; and forms. The metadata section relies on XMP, which provides for broad and flexible document characterization. From an archival perspective, XMP shows much promise for purposes of description, provenance (e.g., history of the document and its context), preservation, and administration. There are also some key technical advantages for digital preservation because XMP metadata is 1) embedded in each file as plain text, which both lessens the possibility of loss and simplifies access, and 2) structured and represented in a manner that conforms to W3C specifications. As with XML and RDF, XMP permits user-defined schemas to describe metadata properties. This offers the prospect of rich metadata that is widely interoperable and interpretable over time. Currently XMP does not provide for machine-readable schemas, which severely limits validation of metadata against applicable schemas. A major problem here is the pending status of the RDF schema specification. W3C is, however, making progress toward formal approval of the specification. If approved as an ISO standard, PDF/A could have an important role in digital preservation. The format promises to be widely suitable for creating and distributing documents, recording evidence of transactions, searching and retrieving, and many other common uses. This means repositories will be able to ingest and manage documents in their original format, which is important from both a cost and authenticity standpoint. PDF/A will also support migration of documents from other formats for long-term retention. Notes [1] The author is reporting from the perspective of a member of the original AIIM/NPES PDF/A committee and the U.S. Technical Advisory Group to the ISO PDF/A JWG and is not representing any official position of the U.S. Library of Congress. (back) [2]Background and other details associated with the AIIM/NPES committee and its association with ISO is available; information about formal ISO status is available. (back) [3]A sampling of these and other community interests
is discussed in “E-Documents
Need E-Preservation,” Washington Technology, 3/3/2003.
(back) [4]For a more-extensive discussion of potential preservation issues associated with unrestricted PDF, see “Archiving and Preserving PDF Files,” RLG DigiNews, 2/15/01. (back) Publishing Information RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org. Materials in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given to use material found here for research purposes or private study. When citing RLG DigiNews, include the article title and author referenced plus "RLG DigiNews." Any uses other than for research or private study require written permission from RLG and/or the author of the article. To receive this, and prior to using RLG DigiNews contents in any presentations or materials you share with others, please contact Jennifer Hartzell , RLG Corporate Communications. Please send comments and questions about this or other issues to the RLG DigiNews editors. Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); Technical Researcher: Richard Entlich; Contributor: Erica Olsen; Copy Editor: Martha Crowe; Production Coordinator: Carla DeMello; Assistant: Valerie Jacoski. All links in this issue were confirmed accurate as of December 15, 2003.
|
||
| |
|
|
|
|
|
|
|
|