June 15, 2003, Volume 7, Number 3
ISSN 1093-5371

 

Like Russian Dolls: Nesting Standards for Digital Preservation

Günter Waibel
Research Libraries Group

On February 14, 2003, the Library of Congress announced the approval of Congress for its plan to build a national infrastructure for the collection and long-term preservation of digital content. The establishment of the initiative, called National Digital Information Infrastructure and Preservation Program (NDIIPP), formally recognizes the importance of digital preservation at the highest level and promises guidance through project outcomes and published research in the years to come (see Editor's Interview). While the Library of Congress engages in some heavy lifting to benefit the entire community, the constituents of the community cannot afford to remain passive. With digital preservation looming large on the national agenda, understanding the terminology and standards emerging in the field becomes the ticket for following or participating in the upcoming discussions.

This article introduces three standards for digital preservation, at least two of which feature prominently in the appendix of the plan Congress just approved.[1] Understanding what these standards are, what they can and cannot do, provides a solid foothold in present and future discussions surrounding long-term retention of digital materials, as well as a leg up on implementation.

breakout quoteAs the title suggests, the three standards nest like Russian dolls—one provides the larger framework within which the following, more granular, standard may be implemented.

  • The Open Archival Information System (OAIS) constitutes the largest Russian doll in the lineup. A standard that comes from the space data community, OAIS is a reference model specifying the responsibilities and data flow surrounding a digital archive at a conceptual level.
  • Metadata Encoding and Transmission Standard (METS), developed by the library community, provides a data structure for exchanging, displaying, and archiving digital objects. It nests within the larger framework of the OAIS as a possible mechanism for data transfer between entities inside and outside the OAIS archive.
  • Our smallest Russian doll has the rather long name NISO[2] Data Dictionary—Technical Metadata for Digital Still Images, and again the library community deserves credit for seeing this specification through its standardization process. The NISO Data Dictionary, also known as Z39.87, describes what fields are necessary in a database for preserving digital images. In its XML encoding called Metadata for Images in XML Standard (MIX), courtesy of the Library of Congress, Z39.87 finds its home in the METS context as an extension detailing a section of administrative metadata appropriately called “technical metadata.”

Although all this probably sounds confusing in bulleted shorthand, it actually makes a lot of sense when properly laid out. This article walks through the standards one by one and elaborates on their functionality and interaction. As it works its way through the standards from the most general to the very specific, it will also home in on digital images as the files to be preserved. The expansive OAIS applies to any type of media, even nondigital materials, whereas METS applies exclusively to the digital realm of images, audio, and video. The NISO Data Dictionary focuses on technical metadata for digital still images.

From a business perspective, digital preservation is a mechanism to ensure return on investment. Enormous amounts of money have been and are being spent on reformatting original materials or creating digital resources natively. If the cultural heritage community can not sustain access to those resources or preserve them, the investment will not bear the envisioned returns. Although a basic understanding of the general problems surrounding preservation in an ever-changing technical environment has started to permeate memory institutions, practical solutions to the challenge are slow to emerge. The three standards, OAIS, METS, and Z39.87, converge as a sustainable system architecture for digital image preservation.

The space data community represents another group with enormous stakes in the long-term viability of its data. Capturing digital imagery of art or manuscripts may seem expensive, but the cost pales in comparison to that of gathering digital imagery from outer space. Under those circumstances, losing access to data is not an option. To foster a framework for preserving data gathered in space, the Consultative Committee for Space Data Systems (CCSDS) began work on an international standard in 1990. A good ten years later the OAIS was approved by the International Organization for Standardization (ISO).

breakout quoteThe fledgling standard met with great interest from the library community. Among its first implementers were the CURL Exemplars in Digital Archives (CEDARS) project and the Networked European Deposit Library (NEDLIB); implicitly, the National Library of Australia (NLA) has also adopted the model.[3] The California Digital Library recently received an Institute of Museum and Library Services (IMLS) grant to take first steps toward a University of California-wide preservation repository implementing the OAIS.

In the standard’s own words, “[a]n OAIS is an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community.”[4] The standard formulates a framework for understanding and applying concepts in long-term preservation of digital information. It provides a common language for talking about these concepts and the organizational principles surrounding an archive. Though the OAIS pertains to both the digital and the analog realm, it has received the most attention for its applicability to digital data.

As a reference model, the OAIS in and of itself does not specify an implementation—it does not tell you which computers to buy, which software to load, or which storage medium to use. The standard does tell you, however, how an archive should be organized. In its so-called functional model, it defines the entities (or departments, if you will) in an archive, their responsibilities, and interactions. The data flows between those entities and the outside world are specified in the information model, which delineates how information gets into the archive, how it lives in the archive, and how it gets served to the public. The OAIS leaves it up to every distinct community to flesh out an implementation of the high-level guidelines. For the cultural heritage community a number of OAIS-related documents exploring the framework’s application to libraries, museums, and archives have come out of the joint OCLC-RLG Preservation Metadata Working Group.[5]


Oais entities and information flows
Fig. 1. The OAIS entities and information flows

As figure 1 illustrates, the OAIS stipulates that an archive (everything within the square box) interacts with a producer as well as a consumer. It takes in data from the producer through its ingest entity, and it serves out data to the consumer through its access entity. Within the archive itself, the data content submitted for preservation gets stored and maintained in the archival storage unit; data management maintains the descriptive metadata identifying the archive’s holdings.

The OAIS dubs the data flowing between the different players information packages, or IPs. The data flows sketched out in figure 1 contain the following information packages:

  • Submission information package (SIP): data flow between producer and archive (ingest)
  • Archival information package (AIP): data archived and managed within the OAIS
  • Dissemination information package (DIP): data flow between the archive (access) and the consumer

The data represented by the information packages may vary according to the specific needs at each station: an archival information package, for example, probably contains more data aimed at managing the object than its more light-weight counterpart on the access side, the dissemination information package. Furthermore, the OAIS details several categories of information comprising a complete information package, but in keeping with its role as a reference model, it stops short of suggesting specific data elements or a specific encoding for the entire bundle of information.[6]

Any community interested in implementing the OAIS has to identify or create a file-exchange format to function as an information package. For the cultural heritage community, METS shows great potential for filling that slot. METS wraps digital surrogates with descriptive and administrative metadata into one XML document. Digital surrogates in this context could be digital image files as well as digital audio or video. At the heart of each METS object sits the structural map, which becomes a table of contents for public access. The hierarchy of the structural map allows the navigation of media files embedded in, or referenced by, the METS object. It enables browsing through the individual pages of an artist’s book as well as jumping to specific segments in a time-based program, for example, a particular section of a video clip.

These so-called digital objects encoded in METS have three main applications that conveniently align with their potential as OAIS information packages.

  • File-exchange format. Since METS pulls together data about the item plus the digital surrogate information and encodes all that data into highly portable XML markup, the standard has been used to transfer data from local systems to union systems. At RLG, for example, a contributor to Cultural Materials can send a collection as individual METS objects, and our load program will ingest the data into a DB2 database. In OAIS terms, in this instance METS functions as a submission information package.
  • Management and preservation format. The METS specifications include an extensible section on administrative metadata that allows the digital object to carry information about administrative contexts such as legal access (intellectual property rights) or the technical environment in which the surrogate files were created (technical metadata). Because of its provisions for administrative metadata, METS lends itself to function as an archival information package in the OAIS framework.
  • Delivery format. Through use of a METS viewer utility the XML markup turns into a standards-based slide show or media player for cultural heritage content. By making digital images, audio, and video navigable, METS turns a multipart object such as a Chinese album consisting of ten leaves into a browsable object (see fig. 2) or provides efficient access to a forty-five-minute oral-history interview. The structural map divides the long audio clip into distinct sections that may be played back individually without playing the entire file. In OAIS lingo, providing public access turns METS into a dissemination information package.
 
A METS object
Fig. 2. A METS object represented in the context of RLG Cultural Materials—a Chinese album from the Chinese Paintings Collection, contributed by the UC Berkeley Art Museum and Pacific Film Archive.

The METS XML schema divides the standard into a core component and several extension components. The METS core supports navigation and browsing of a digital object. It consists of a header, content files, and a structural map. The METS extension components support discovery and management of the digital object. They consist of descriptive metadata and administrative metadata, which in turn split into technical, source, digital provenance, and rights metadata.

a graphical representation of a METS object
Fig. 3. A graphical representation of a METS object (Chinese album with three leaves, and two details on leaf one).

Figure 3 details the components of a METS object and one possible set of relationships among them.

  • A header describes the METS object itself. It contains information along the lines of “who created this object, when, for what purpose.” The header information aids in managing the METS file proper.
  • The descriptive metadata section contains information describing the information resource represented by the digital object. Descriptive metadata enables discovery of the resource.
  • The structural map, represented by the individual leaves and details, orders the digital files of the object into a browsable hierarchy.
  • The content file section, represented by images one through five, declares which digital files constitute the object. Files may be either embedded in the object or referenced.
  • The administrative metadata section contains information about the digital files declared in the content file section. This section subdivides into
    • technical metadata, specifying the technical characteristics of a file
    • source metadata, specifying the source of capture (e.g., direct capture or reformatted 4 x 5 transparency)
    • digital provenance metadata, specifying the changes a file has undergone since its birth
    • rights metadata, specifying the conditions of legal access
      The sections on technical metadata, source metadata, and digital provenance metadata carry the information pertinent to digital preservation.
  • Honorary mention for the sake of comprehensiveness. A behavior section, not shown in figure 3, associates executables with a METS object. For example, a METS object may rely on a certain piece of code to instantiate for viewing, and the behavior section could reference that code.

The METS designers leveraged the combined power of the W3C specifications for XML schema and Namespaces in XML to create a flexible standard.

  • XML schema provides a way to specify the rules for a valid XML document.[7] The schema can be used to parse an XML document instance or, to put it in less technical terms, to verify that the XML markup conforms to the standard formalized by the schema. Using XML schema to define METS opened the doors to exploiting yet another W3C specification called Namespaces.[8]
  • Namespaces empowers METS to delegate certain metadata tasks to other XML extension schemas. For example, the METS schema itself does not dictate how you describe the resource represented by the digital object—it contains no elements for descriptive metadata. However, it contains a placeholder that may be realized through tags from an external XML schema for description.

In this way, each community can plug in its own preferred descriptive elements as long as they have been formalized into a schema.[9] The visual resources community, for example, may choose to extend METS using the VRA Core, while libraries might be more inclined to stick with Metadata Object Description Schema (MODS) from the Library of Congress. Others may decide the Dublin Core (DC) satisfies their access needs. The flexibility achieved through namespaces gives METS the potential for implementation across a wide range of communities.

The same logic applies to all components of administrative metadata. Each community has the opportunity to specify what data it deems most important for the management of its information, formalize those requirements into an XML schema, and use that schema as an extension to the hub-standard METS. For an example of a project that has identified or created a comprehensive suite of METS extensions, consult the Library of Congress AV Prototyping project.

An alternative to embedding metadata for the extension components through XML Namespaces and external schemas consists in simply referencing the data from within the object. Descriptive or administrative metadata may live outside the XML markup in a database, to which the METS object can point. Even down to the level of media files, METS provides the dual option of referencing or embedding. The METS specification makes provisions for wrapping the actual bit stream of a digital file in the XML. In most cases, however, files live at online locations pointed to from within the object.

In the realm of technical metadata, a fledgling NISO standard takes center stage for describing the different parameters of digital image files. As the NISO Data Dictionary—Technical Metadata for Digital Still Images, or Z39.87, the standard specifies a list of metadata elements. The Library of Congress, motivated by its AV Prototyping project, created an XML schema encoding for Z39.87, called NISO Metadata for Images in XML Standard (MIX). The XML schema constitutes the smallest Russian doll in our series of nesting standards, as it may be plugged into the METS framework as an extension schema for technical metadata. The standard also proposes fields for the source and digital provenance sections of METS.

The NISO effort draws heavily on the Tagged Image File Format specifications, better know by their acronym TIFF. As the name implies, this format uses tags to define the characteristics of a digital file.[10] Image creation applications write the necessary parameters to the tags within the TIFF file, which means that the majority of the data Z39.87 covers already exists in file headers. To complete the metadata cycle, harvester utilities have to extract the information from the image file headers and import it into digital-asset-management systems for long-term preservation. By using the image file format specification as an integral part of the Data Dictionary, the standard leverages existing metadata to achieve cost savings.

On the other hand, in going beyond the TIFF specifications for some elements, the NISO standard acknowledges information outside the TIFF scope that plays an important role in digital preservation. From this vantage point, the Data Dictionary becomes an important tool for educating vendors about the metadata our community sees as invaluable to preserve our investment. RLG is investigating the formation of a group advocating among digital camera-back vendors for the cultural heritage community’s metadata needs.[11] An industry standard for consumer digital cameras called DIG35 already has broad support among vendors. DIG35 allows transfer of information from the camera to the software utility that consumers use to manage their holiday snapshots. Building on that model, NISO Z39.87 in its XML instantiation MIX could become the file-exchange format to go between high-end scanners or camera backs and sophisticated asset-management databases.

The Data Dictionary divides the technical metadata elements into four groups.

  • Basic image parameters record information crucial to displaying a viewable image.
  • With just this information alone a programmer should be able to build a viewing application for the image from scratch. Elements represented in this section include format (GIF, JFIF/JPEG, TIFF, etc.), compression, and photometric interpretation (color space).
  • Image creation metadata records information crucial to understanding the technical environment in which a digital image file was captured. Just as in humans, any number of characteristics or issues of an image can be traced back to its birth. Elements represented in this section include SourceType (the analog source of a capture), ScanningSystem (identification of the particular scanning device used), and DateTimeCreated (date of the image’s birth).
  • Imaging performance assessment metadata records information that allows evaluation of the digital image’s quality, or output accuracy. This data aids in uncovering the characteristics of the original source of the image and functions as a benchmark for displaying or printing the file. Elements represented in this section include width and height of the digital image and the source, as well as various parameters for capturing the sampling frequency of an image and its color characteristics. Furthermore, the section hosts information about any color targets (such as GretagMacbeth or Q60) included in a capture.
  • Change history metadata records information about the processes applied to an image over its life cycle. This data tracks any changes to the original file, for example, during the course of preservation activities such as refreshing (copying the file to a new storage format) or migration (saving the file from an eclipsing file format to an emerging file format). Elements represented in this section include DateTimeProcessed, ProcessingAgency, and ProcessingSoftware.

For any institution just starting out on the path of digital preservation, managing technical metadata through the NISO Data Dictionary is a great first step. The term data dictionary itself comes from the database community; it refers to a file defining the basic organization of a database down to its individual fields and field types. NISO Z39.87 represents a blueprint for a database or a database module that can be implemented fairly quickly—all the intellectual legwork has already been done by the standards committee.

For expanding the database to include structural metadata relating files to each other, plus a descriptive record, as well as rights metadata, the database could be augmented by looking at METS and its extension schemas. Again, the Library of Congress AV Prototyping project offers a model implementation of a database using the METS approach. Scaling up to the bigger picture, this database could find its home in an archival environment specified by the OAIS.

OAIS, METS, and NISO Z39-87 as nesting standards
Fig. 4. OAIS, METS, and NISO Z39.87 as nesting standards

To summarize: as illustrated by figure 4, the OAIS stipulates information packages, which find instantiation in METS; METS stipulates an extension schema for technical metadata, which finds an instantiation in Z39.87’s XML schema, MIX. Now, after the detailed review, the first bulleted list in this article should make a lot more sense.

In broad strokes, digital preservation with the nesting standards OAIS, METS, and Z39.87 looks like a puzzle with all the pieces neatly falling into place. In the details, however, some harmonization issues between the standards remain. For example, the OAIS model breaks an information package into different subcomponents than the METS schema; the NISO Data Dictionary and its XML encoding MIX cover not only the technical metadata extension of METS, but also some elements that the digital object standard relegates to sections on source and digital provenance. Nevertheless, the convergence of three standards developed independently illustrates that a holistic view of digital preservation is emerging. Only widespread implementation will tell whether the theory as outlined by the standards can hold up in practice.

Footnotes
[1] Both OAIS and METS are referenced by multiple essays.(back)

[2] National Information Standards Organization.(back)


[3] For a review of CEDARS, NEDLIB, and NLA implementations of the OAIS, see “Preservation Metadata for Digital Objects: A Review of the State of the Art,” published by the OCLC/RLG Working Group on Preservation Metadata.(back)


[4] See http://wwwclassic.ccsds.org/documents/pdf/CCSDS-650.0-B-1.pdf, p.1-1.(back)


[5] RLG also maintains an OAIS Web site with further links.(back)


[6] For an introduction to the OAIS that goes beyond the present article, consult “Meeting the Challenges of Digital Preservation: The OAIS Reference Model,” by Brian Lavoie.(back)


[7] Some of you may be more familiar with document type definitions (DTDs) to specify rules for valid SGML and XML markup.(back)


[8] One of the key differences between XML schemas and DTDs is that only schemas allow extensions through the use of XML namespaces.(back)


[9] For more information on XML in the cultural heritage community, see “DigiCult Technology Watch Briefing 7: The XML Family of Technologies.(back)

[10] For the full TIFF tag library, see Appendix A of the format specifications.(back)

[11]For more information about this fledgling initiative, please contact the author.

 

Publishing Information

RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org.

Materials in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given to use material found here for research purposes or private study. When citing RLG DigiNews, include the article title and author referenced plus "RLG DigiNews,

." Any uses other than for research or private study require written permission from RLG and/or the author of the article. To receive this, and prior to using RLG DigiNews contents in any presentations or materials you share with others, please contact Jennifer Hartzell (jlh@notes.rlg.org), RLG Corporate Communications.

Please send comments and questions about this or other issues to the RLG DigiNews editors.

Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); Technical Researcher: Richard Entlich; Contributor: Erica Olsen; Copy Editor: Martha Crowe; Production Coordinator: Carla DeMello; Assistant: Valerie Jacoski.

All links in this issue were confirmed accurate as of June 13, 2003.

   
 
RLG DigiNews
BROWSE ISSUES
SEARCH
RLG