RLG DigiNews
BROWSE ISSUES
SEARCH
RLG
   
  June 15, 2003, Volume 7, Number 3
ISSN 1093-5371


Table of Contents


Editor's Interview
National Digital Information Infrastructure and Preservation Program: An Interview with Laura Campbell

Feature Article 2
Digital Preservation Like Russian Dolls: Nesting Standards for Digital Preservation, by Günter Waibel

Feature Article 3
Saving Digital Heritage—A UNESCO Campaign, by Colin Webb

Highlighted Web Site
Digital Dog

FAQ
Squeezing More Life Out of Bitonal Files: A Study of Black and White. Part III, by Richard Entlich

Calendar of Events

Announcements

print this article

Editor's Interview

National Digital Information Infrastructure and Preservation Program

Laura Campbell
Associate Librarian for Strategic Initiatives
Library of Congress

Editors’ Note
In January Congress approved the Library of Congress’s Plan for the National Digital Information Infrastructure and Preservation Program (NDIIPP), which will enable the Library to launch the initial phase of building a national infrastructure for the collection and long-term preservation of digital content. With this approval Congress also released $35 million for the next phase of NDIIPP, of which $15 million will be matched dollar-for-dollar from nonfederal sources. Following is an interview with Laura Campbell, the Associate Librarian for Strategic Initiatives, who is directing the work of this next phase of NDIIPP. Queries may be addressed to her special assistant, George Coulbourne.

It’s wonderful that Congress has authorized and financially supported the next phase of NDIIPP, but how will the funds be committed? What percentage of funding will be spent on research, planning, implementation, evaluation, and other core areas?

The majority of the funds will be used for testing various models that support the capture and preservation of content. The projects will focus on the preservation of a variety of digital media: e-books and e-journals, digital film, audio, and television. We will be working with other repositories as well as rights holders to test approaches that support a distributed digital preservation infrastructure for collecting and preserving content. This infrastructure will consist of a network of committed partners with defined roles and responsibilities working through a preservation architecture.

Other projects will test and help define the digital preservation architecture spelled out in the NDIIPP report. Approximately ten percent of the funds will support basic digital preservation research to help build solutions that are flexible and sustainable for the long term.

How will proposals be solicited and accepted?

Through our Web site we anticipate making calls for proposals in late summer.

What outcomes do you expect from this phase, and how will you measure success in meeting your goals and objectives?

Outcomes expected from this phase include establishing the groundwork for what we call the "digital preservation infrastructure," which has two components.

  • The first is the "digital preservation network," which will comprise a group of partners committed to collecting and preserving digital information.
  • The second component is the "digital preservation architecture," or the technology that will support long-term preservation in a distributed environment. This phase will conclude with an advanced design for the architecture.

Copyright and the intellectual property issues associated with digital information will also be a focus of this phase. We will work closely with the U.S. Copyright Office, which is part of the Library of Congress, and many stakeholders in the broader community to address issues that advance or impede preservation of content.

Communication is a key component of NDIIPP. It is critical to convey information about the program to the stakeholders in digital preservation as well as to the general public. Currently, content creators and distributors understand to varying degrees what digital preservation is, why it is needed, and what their role in preservation should be. Unlike in the analog world, where preservation decisions may be made long after the content is created, in the digital world preservation decisions often need to be made coincident with creation. Think of all the Web sites, for example, that are no longer available.

We also know from experience that the success of any new technology requires support—and understanding—from the general public. That was the case when the Library began its National Digital Library Program. A large part of the success of that public-private partnership ($15 million from Congress; more than $45 million from private donors) was the result of the public’s awareness of the importance of having remote access to the riches of the Library of Congress’s high-quality educational content. The more Library materials we made available, the more the public wanted. From such a base of support came private sector support. We believe that as the public increases its awareness of the importance of digital preservation, support for the program will grow.

The metrics to measure success will vary according to the component we are examining. For example, at a base level we can measure the success of the preservation architecture the way any design program is evaluated—by testing it. Does the architecture support long-term preservation? Is it flexible enough to change as technology changes? Can users and donors of content rely on its integrity?

The success of the preservation network must be judged in more qualitative terms. We know that we cannot capture and preserve all digital information, nor is it desirable to do so. Partners will have to make decisions on what to keep and who should keep it. In many ways this is no different than the decisions that are made every day by the selecting officials at the Library of Congress. The Library retains for its collections only about 7,000 of the approximately 20,000 items it receives each business day. Other repositories make these same decisions. The hope is that, as with analog materials, we are collecting and preserving the information that will be most useful to the U.S. Congress, researchers, and lifelong learners for generations to come. It is the generations of tomorrow who will judge the success of the decisions we make today.

As far as communication is concerned, we will know we have succeeded when there is a national conversation about the importance of digital preservation such that the public and private sectors support the goals of NDIIPP.

Who are the key stakeholders for LC in this effort, and how will you involve them? What about the National Library of Medicine and the National Agriculture Library? Research libraries? Others?

In the broadest terms, anyone who creates or uses digital information is a stakeholder in NDIIPP. NLM and NAL are key stakeholders, as are all the libraries and other repositories in this nation and around the world. We formed the National Digital Strategy Advisory Board with the idea that its members are representatives for the various stakeholder communities. The NDIIPP legislation mandates that “the overall plan should set forth a strategy for the Library of Congress, in collaboration with other Federal and non-Federal entities, to identify a national network of libraries and other organizations with responsibilities for collecting digital materials that will provide access to and maintain those materials.”

How can other institutions participate in NDIIPP?

We are interested in hearing from institutions and organizations who are collecting and preserving digital content and are interested in becoming involved in the preservation network of committed partners. They can send inquiries to http://www.digitalpreservation.gov/ndiipp/contact.html.

How will cultural repositories—large and small—benefit from NDIIPP?

We hope to set forth, in collaboration with others, a national approach to sharing the responsibility for the collection and preservation of digital content, leveraging what any one institution can do alone.

Desirable benefits of NDIIPP include

  1. shared responsibility for collection and selection development
  2. standards and best practices for managing content
  3. business models to support preservation and the shared responsibility for collection and selection development (no. 1 above)
  4. intellectual property agreements for use of rights-protected content
  5. a technical framework within which to work together

Ultimately there will be an operational environment that allows many institutions, big and small, to be part of a network that collects, preserves, and provides rights-protected access to digital content.

Digital preservation doesn’t stop at the border. Would you describe your plans for international collaboration?

With its core mission to make information available and useful and to sustain and preserve a universal collection of knowledge and creativity, regardless of format, for current and future generations of Congress and the American people, the Library of Congress has a long history as a trusted convener that is able to facilitate the development of standards and best practices in librarianship across the country and internationally.

The NDIIPP plan represents the fruits of intensive consultations with a wide range of American and international innovators, creators, and high-level managers of digital information in the private and public sectors. We achieved this through surveying national and international initiatives (Appendix 5 of the report ) and during several stakeholder meetings with international participation. This was accompanied by ongoing interviews and consultation with a broad group of experts.

There is nothing comparable to the congressional action taken and funding provided in behalf of digital preservation abroad; however, areas of potential collaboration with the United States include

  • technical research
  • standards development
  • collection development
  • development of shared services needed by repositories

The Web site you have established is very helpful in conveying information on NDIIPP. How else will you keep individuals and organizations informed?

NDIIPP has already received broad coverage from the media in more than fifty publications, including the New York Times, the Washington Post, and the Chronicle of Higher Education. Their articles have been the direct result of our communications efforts. We will continue to work with major media—both general-interest as well as trade press—to keep NDIIPP in the public eye as it progresses in meeting its goals. We will also continue to participate in public presentations and forums, such as at the Library’s exhibit booth during the American Library Association meeting and in other appropriate venues.

print this article

Like Russian Dolls: Nesting Standards for Digital Preservation

Günter Waibel
Research Libraries Group


On February 14, 2003, the Library of Congress announced the approval of Congress for its plan to build a national infrastructure for the collection and long-term preservation of digital content. The establishment of the initiative, called National Digital Information Infrastructure and Preservation Program (NDIIPP), formally recognizes the importance of digital preservation at the highest level and promises guidance through project outcomes and published research in the years to come (see Editor's Interview). While the Library of Congress engages in some heavy lifting to benefit the entire community, the constituents of the community cannot afford to remain passive. With digital preservation looming large on the national agenda, understanding the terminology and standards emerging in the field becomes the ticket for following or participating in the upcoming discussions.

This article introduces three standards for digital preservation, at least two of which feature prominently in the appendix of the plan Congress just approved.[1] Understanding what these standards are, what they can and cannot do, provides a solid foothold in present and future discussions surrounding long-term retention of digital materials, as well as a leg up on implementation.

breakout quoteAs the title suggests, the three standards nest like Russian dolls—one provides the larger framework within which the following, more granular, standard may be implemented.

  • The Open Archival Information System (OAIS) constitutes the largest Russian doll in the lineup. A standard that comes from the space data community, OAIS is a reference model specifying the responsibilities and data flow surrounding a digital archive at a conceptual level.
  • Metadata Encoding and Transmission Standard (METS), developed by the library community, provides a data structure for exchanging, displaying, and archiving digital objects. It nests within the larger framework of the OAIS as a possible mechanism for data transfer between entities inside and outside the OAIS archive.
  • Our smallest Russian doll has the rather long name NISO[2] Data Dictionary—Technical Metadata for Digital Still Images, and again the library community deserves credit for seeing this specification through its standardization process. The NISO Data Dictionary, also known as Z39.87, describes what fields are necessary in a database for preserving digital images. In its XML encoding called Metadata for Images in XML Standard (MIX), courtesy of the Library of Congress, Z39.87 finds its home in the METS context as an extension detailing a section of administrative metadata appropriately called “technical metadata.”

Although all this probably sounds confusing in bulleted shorthand, it actually makes a lot of sense when properly laid out. This article walks through the standards one by one and elaborates on their functionality and interaction. As it works its way through the standards from the most general to the very specific, it will also home in on digital images as the files to be preserved. The expansive OAIS applies to any type of media, even nondigital materials, whereas METS applies exclusively to the digital realm of images, audio, and video. The NISO Data Dictionary focuses on technical metadata for digital still images.

From a business perspective, digital preservation is a mechanism to ensure return on investment. Enormous amounts of money have been and are being spent on reformatting original materials or creating digital resources natively. If the cultural heritage community can not sustain access to those resources or preserve them, the investment will not bear the envisioned returns. Although a basic understanding of the general problems surrounding preservation in an ever-changing technical environment has started to permeate memory institutions, practical solutions to the challenge are slow to emerge. The three standards, OAIS, METS, and Z39.87, converge as a sustainable system architecture for digital image preservation.

The space data community represents another group with enormous stakes in the long-term viability of its data. Capturing digital imagery of art or manuscripts may seem expensive, but the cost pales in comparison to that of gathering digital imagery from outer space. Under those circumstances, losing access to data is not an option. To foster a framework for preserving data gathered in space, the Consultative Committee for Space Data Systems (CCSDS) began work on an international standard in 1990. A good ten years later the OAIS was approved by the International Organization for Standardization (ISO).

breakout quoteThe fledgling standard met with great interest from the library community. Among its first implementers were the CURL Exemplars in Digital Archives (CEDARS) project and the Networked European Deposit Library (NEDLIB); implicitly, the National Library of Australia (NLA) has also adopted the model.[3] The California Digital Library recently received an Institute of Museum and Library Services (IMLS) grant to take first steps toward a University of California-wide preservation repository implementing the OAIS.

In the standard’s own words, “[a]n OAIS is an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community.”[4] The standard formulates a framework for understanding and applying concepts in long-term preservation of digital information. It provides a common language for talking about these concepts and the organizational principles surrounding an archive. Though the OAIS pertains to both the digital and the analog realm, it has received the most attention for its applicability to digital data.

As a reference model, the OAIS in and of itself does not specify an implementation—it does not tell you which computers to buy, which software to load, or which storage medium to use. The standard does tell you, however, how an archive should be organized. In its so-called functional model, it defines the entities (or departments, if you will) in an archive, their responsibilities, and interactions. The data flows between those entities and the outside world are specified in the information model, which delineates how information gets into the archive, how it lives in the archive, and how it gets served to the public. The OAIS leaves it up to every distinct community to flesh out an implementation of the high-level guidelines. For the cultural heritage community a number of OAIS-related documents exploring the framework’s application to libraries, museums, and archives have come out of the joint OCLC-RLG Preservation Metadata Working Group.[5]


Oais entities and information flows
Fig. 1. The OAIS entities and information flows

As figure 1 illustrates, the OAIS stipulates that an archive (everything within the square box) interacts with a producer as well as a consumer. It takes in data from the producer through its ingest entity, and it serves out data to the consumer through its access entity. Within the archive itself, the data content submitted for preservation gets stored and maintained in the archival storage unit; data management maintains the descriptive metadata identifying the archive’s holdings.

The OAIS dubs the data flowing between the different players information packages, or IPs. The data flows sketched out in figure 1 contain the following information packages:

  • Submission information package (SIP): data flow between producer and archive (ingest)
  • Archival information package (AIP): data archived and managed within the OAIS
  • Dissemination information package (DIP): data flow between the archive (access) and the consumer

The data represented by the information packages may vary according to the specific needs at each station: an archival information package, for example, probably contains more data aimed at managing the object than its more light-weight counterpart on the access side, the dissemination information package. Furthermore, the OAIS details several categories of information comprising a complete information package, but in keeping with its role as a reference model, it stops short of suggesting specific data elements or a specific encoding for the entire bundle of information.[6]

Any community interested in implementing the OAIS has to identify or create a file-exchange format to function as an information package. For the cultural heritage community, METS shows great potential for filling that slot. METS wraps digital surrogates with descriptive and administrative metadata into one XML document. Digital surrogates in this context could be digital image files as well as digital audio or video. At the heart of each METS object sits the structural map, which becomes a table of contents for public access. The hierarchy of the structural map allows the navigation of media files embedded in, or referenced by, the METS object. It enables browsing through the individual pages of an artist’s book as well as jumping to specific segments in a time-based program, for example, a particular section of a video clip.

These so-called digital objects encoded in METS have three main applications that conveniently align with their potential as OAIS information packages.

  • File-exchange format. Since METS pulls together data about the item plus the digital surrogate information and encodes all that data into highly portable XML markup, the standard has been used to transfer data from local systems to union systems. At RLG, for example, a contributor to Cultural Materials can send a collection as individual METS objects, and our load program will ingest the data into a DB2 database. In OAIS terms, in this instance METS functions as a submission information package.
  • Management and preservation format. The METS specifications include an extensible section on administrative metadata that allows the digital object to carry information about administrative contexts such as legal access (intellectual property rights) or the technical environment in which the surrogate files were created (technical metadata). Because of its provisions for administrative metadata, METS lends itself to function as an archival information package in the OAIS framework.
  • Delivery format. Through use of a METS viewer utility the XML markup turns into a standards-based slide show or media player for cultural heritage content. By making digital images, audio, and video navigable, METS turns a multipart object such as a Chinese album consisting of ten leaves into a browsable object (see fig. 2) or provides efficient access to a forty-five-minute oral-history interview. The structural map divides the long audio clip into distinct sections that may be played back individually without playing the entire file. In OAIS lingo, providing public access turns METS into a dissemination information package.
 
A METS object
Fig. 2. A METS object represented in the context of RLG Cultural Materials—a Chinese album from the Chinese Paintings Collection, contributed by the UC Berkeley Art Museum and Pacific Film Archive.

The METS XML schema divides the standard into a core component and several extension components. The METS core supports navigation and browsing of a digital object. It consists of a header, content files, and a structural map. The METS extension components support discovery and management of the digital object. They consist of descriptive metadata and administrative metadata, which in turn split into technical, source, digital provenance, and rights metadata.

a graphical representation of a METS object
Fig. 3. A graphical representation of a METS object (Chinese album with three leaves, and two details on leaf one).

Figure 3 details the components of a METS object and one possible set of relationships among them.

  • A header describes the METS object itself. It contains information along the lines of “who created this object, when, for what purpose.” The header information aids in managing the METS file proper.
  • The descriptive metadata section contains information describing the information resource represented by the digital object. Descriptive metadata enables discovery of the resource.
  • The structural map, represented by the individual leaves and details, orders the digital files of the object into a browsable hierarchy.
  • The content file section, represented by images one through five, declares which digital files constitute the object. Files may be either embedded in the object or referenced.
  • The administrative metadata section contains information about the digital files declared in the content file section. This section subdivides into
    • technical metadata, specifying the technical characteristics of a file
    • source metadata, specifying the source of capture (e.g., direct capture or reformatted 4 x 5 transparency)
    • digital provenance metadata, specifying the changes a file has undergone since its birth
    • rights metadata, specifying the conditions of legal access
      The sections on technical metadata, source metadata, and digital provenance metadata carry the information pertinent to digital preservation.
  • Honorary mention for the sake of comprehensiveness. A behavior section, not shown in figure 3, associates executables with a METS object. For example, a METS object may rely on a certain piece of code to instantiate for viewing, and the behavior section could reference that code.

The METS designers leveraged the combined power of the W3C specifications for XML schema and Namespaces in XML to create a flexible standard.

  • XML schema provides a way to specify the rules for a valid XML document.[7] The schema can be used to parse an XML document instance or, to put it in less technical terms, to verify that the XML markup conforms to the standard formalized by the schema. Using XML schema to define METS opened the doors to exploiting yet another W3C specification called Namespaces.[8]
  • Namespaces empowers METS to delegate certain metadata tasks to other XML extension schemas. For example, the METS schema itself does not dictate how you describe the resource represented by the digital object—it contains no elements for descriptive metadata. However, it contains a placeholder that may be realized through tags from an external XML schema for description.

In this way, each community can plug in its own preferred descriptive elements as long as they have been formalized into a schema.[9] The visual resources community, for example, may choose to extend METS using the VRA Core, while libraries might be more inclined to stick with Metadata Object Description Schema (MODS) from the Library of Congress. Others may decide the Dublin Core (DC) satisfies their access needs. The flexibility achieved through namespaces gives METS the potential for implementation across a wide range of communities.

The same logic applies to all components of administrative metadata. Each community has the opportunity to specify what data it deems most important for the management of its information, formalize those requirements into an XML schema, and use that schema as an extension to the hub-standard METS. For an example of a project that has identified or created a comprehensive suite of METS extensions, consult the Library of Congress AV Prototyping project.

An alternative to embedding metadata for the extension components through XML Namespaces and external schemas consists in simply referencing the data from within the object. Descriptive or administrative metadata may live outside the XML markup in a database, to which the METS object can point. Even down to the level of media files, METS provides the dual option of referencing or embedding. The METS specification makes provisions for wrapping the actual bit stream of a digital file in the XML. In most cases, however, files live at online locations pointed to from within the object.

In the realm of technical metadata, a fledgling NISO standard takes center stage for describing the different parameters of digital image files. As the NISO Data Dictionary—Technical Metadata for Digital Still Images, or Z39.87, the standard specifies a list of metadata elements. The Library of Congress, motivated by its AV Prototyping project, created an XML schema encoding for Z39.87, called NISO Metadata for Images in XML Standard (MIX). The XML schema constitutes the smallest Russian doll in our series of nesting standards, as it may be plugged into the METS framework as an extension schema for technical metadata. The standard also proposes fields for the source and digital provenance sections of METS.

The NISO effort draws heavily on the Tagged Image File Format specifications, better know by their acronym TIFF. As the name implies, this format uses tags to define the characteristics of a digital file.[10] Image creation applications write the necessary parameters to the tags within the TIFF file, which means that the majority of the data Z39.87 covers already exists in file headers. To complete the metadata cycle, harvester utilities have to extract the information from the image file headers and import it into digital-asset-management systems for long-term preservation. By using the image file format specification as an integral part of the Data Dictionary, the standard leverages existing metadata to achieve cost savings.

On the other hand, in going beyond the TIFF specifications for some elements, the NISO standard acknowledges information outside the TIFF scope that plays an important role in digital preservation. From this vantage point, the Data Dictionary becomes an important tool for educating vendors about the metadata our community sees as invaluable to preserve our investment. RLG is investigating the formation of a group advocating among digital camera-back vendors for the cultural heritage community’s metadata needs.[11] An industry standard for consumer digital cameras called DIG35 already has broad support among vendors. DIG35 allows transfer of information from the camera to the software utility that consumers use to manage their holiday snapshots. Building on that model, NISO Z39.87 in its XML instantiation MIX could become the file-exchange format to go between high-end scanners or camera backs and sophisticated asset-management databases.

The Data Dictionary divides the technical metadata elements into four groups.

  • Basic image parameters record information crucial to displaying a viewable image.
  • With just this information alone a programmer should be able to build a viewing application for the image from scratch. Elements represented in this section include format (GIF, JFIF/JPEG, TIFF, etc.), compression, and photometric interpretation (color space).
  • Image creation metadata records information crucial to understanding the technical environment in which a digital image file was captured. Just as in humans, any number of characteristics or issues of an image can be traced back to its birth. Elements represented in this section include SourceType (the analog source of a capture), ScanningSystem (identification of the particular scanning device used), and DateTimeCreated (date of the image’s birth).
  • Imaging performance assessment metadata records information that allows evaluation of the digital image’s quality, or output accuracy. This data aids in uncovering the characteristics of the original source of the image and functions as a benchmark for displaying or printing the file. Elements represented in this section include width and height of the digital image and the source, as well as various parameters for capturing the sampling frequency of an image and its color characteristics. Furthermore, the section hosts information about any color targets (such as GretagMacbeth or Q60) included in a capture.
  • Change history metadata records information about the processes applied to an image over its life cycle. This data tracks any changes to the original file, for example, during the course of preservation activities such as refreshing (copying the file to a new storage format) or migration (saving the file from an eclipsing file format to an emerging file format). Elements represented in this section include DateTimeProcessed, ProcessingAgency, and ProcessingSoftware.

For any institution just starting out on the path of digital preservation, managing technical metadata through the NISO Data Dictionary is a great first step. The term data dictionary itself comes from the database community; it refers to a file defining the basic organization of a database down to its individual fields and field types. NISO Z39.87 represents a blueprint for a database or a database module that can be implemented fairly quickly—all the intellectual legwork has already been done by the standards committee.

For expanding the database to include structural metadata relating files to each other, plus a descriptive record, as well as rights metadata, the database could be augmented by looking at METS and its extension schemas. Again, the Library of Congress AV Prototyping project offers a model implementation of a database using the METS approach. Scaling up to the bigger picture, this database could find its home in an archival environment specified by the OAIS.

OAIS, METS, and NISO Z39-87 as nesting standards
Fig. 4. OAIS, METS, and NISO Z39.87 as nesting standards

To summarize: as illustrated by figure 4, the OAIS stipulates information packages, which find instantiation in METS; METS stipulates an extension schema for technical metadata, which finds an instantiation in Z39.87’s XML schema, MIX. Now, after the detailed review, the first bulleted list in this article should make a lot more sense.

In broad strokes, digital preservation with the nesting standards OAIS, METS, and Z39.87 looks like a puzzle with all the pieces neatly falling into place. In the details, however, some harmonization issues between the standards remain. For example, the OAIS model breaks an information package into different subcomponents than the METS schema; the NISO Data Dictionary and its XML encoding MIX cover not only the technical metadata extension of METS, but also some elements that the digital object standard relegates to sections on source and digital provenance. Nevertheless, the convergence of three standards developed independently illustrates that a holistic view of digital preservation is emerging. Only widespread implementation will tell whether the theory as outlined by the standards can hold up in practice.

Footnotes
[1] Both OAIS and METS are referenced by multiple essays.(back)

[2] National Information Standards Organization.(back)


[3] For a review of CEDARS, NEDLIB, and NLA implementations of the OAIS, see “Preservation Metadata for Digital Objects: A Review of the State of the Art,” published by the OCLC/RLG Working Group on Preservation Metadata.(back)


[4] See http://wwwclassic.ccsds.org/documents/pdf/CCSDS-650.0-B-1.pdf, p.1-1.(back)


[5] RLG also maintains an OAIS Web site with further links.(back)


[6] For an introduction to the OAIS that goes beyond the present article, consult “Meeting the Challenges of Digital Preservation: The OAIS Reference Model,” by Brian Lavoie.(back)


[7] Some of you may be more familiar with document type definitions (DTDs) to specify rules for valid SGML and XML markup.(back)


[8] One of the key differences between XML schemas and DTDs is that only schemas allow extensions through the use of XML namespaces.(back)


[9] For more information on XML in the cultural heritage community, see “DigiCult Technology Watch Briefing 7: The XML Family of Technologies.(back)

[10] For the full TIFF tag library, see Appendix A of the format specifications.(back)

[11]For more information about this fledgling initiative, please contact the author.

print this article

Saving Digital Heritage—A UNESCO Campaign

Colin Webb
National Library of Australia

Considering that the disappearance of heritage in whatever form constitutes an impoverishment of the heritage of all nations …
Recognising that … resources of information and creative expression are increasingly produced, distributed, accessed and maintained in digital form, creating a new legacy—the digital heritage …
Understanding that this digital heritage is at risk of being lost and that its preservation for the benefit of present and future generations is an urgent issue of worldwide concern …

So begins an important new document being prepared for submission to the General Conference of UNESCO, the United Nations Educational, Scientific and Cultural Organisation. The Draft Charter on the Preservation of the Digital Heritage was positively received by a recent session of the UNESCO Executive Board, which asked for further consultations during preparation of a final draft for consideration. The Draft Charter is one very visible element in an international campaign to address the barriers to digital continuity and to head off the emergence of a second “digital divide,” in which the tools of digital preservation are restricted to the heritage of a well-resourced few.

As well as the Charter, other elements of UNESCO’s strategy for promoting digital preservation include widespread consultations, the development of practical and technical guidelines, and a range of pilot projects. UNESCO has been critical in fostering the understanding and preservation of other kinds of heritage through avenues such as the World Heritage Convention and the Memory of the World program. Given the organisation’s commitment to the safeguarding of recorded knowledge evident in its Information for All program, it is not surprising that UNESCO has been concerned at the prospect of the loss of vast amounts of digital information.

Digital technology’s immense potential for human benefit in so many areas—communication, expression, knowledge sharing, education, community building, accountability, to name just a few—is a tantalizing promise so easily denied by the lack of means, knowledge, or will to deal with its other great potential: rapid loss of access.

The impetus for this campaign was embedded in a resolution passed by the UNESCO General Conference at its previous meeting in 2000. That resolution, drafted in part by the Council of Directors of National Libraries (CDNL), highlighted the need to safeguard endangered digital memory. Following that, as a basis for developing a UNESCO strategy, the European Commission on Preservation and Access (EPCA) was commissioned to prepare a discussion paper outlining the issues in digital preservation for debate.

Consultation Process

As well as circulating for comment the draft papers produced in the campaign to governments and nongovernment organisations and experts all over the world, the campaign has featured a number of regional consultation meetings convened specifically to raise issues of regional concern and to provide comment on the Preliminary Draft Charter and Draft Guidelines on the Preservation of Digital Heritage. The meetings were held between November 2002 and March 2003, in Canberra, Australia (for Asia and the Pacific); in Managua, Nicaragua (for Latin America and the Caribbean); in Addis Ababa, Ethiopia (for Africa); in Riga, Latvia (for the Baltic states); and in Budapest, Hungary (for Eastern Europe).

All the meetings confirmed the need for urgent action and the great distance to be traveled before preservation of digital heritage is a reality in most countries. In total, around 175 experts and stakeholders from eighty-six countries participated in the five meetings, representing libraries, records archives, museums, audiovisual archives, data archives, producers and publishers of digital content, lawyers, universities and academies, governments, standardization agencies, community development organisations, computer industries, and researchers, among others.

Draft Charter on the Preservation of the Digital Heritage

Charters and declarations promulgated by UNESCO are meant to be “normative” documents that member states agree to through a vote of acceptance rather than by individual ratification. They are not binding and do not require any specific action on the part of governments, but they do express aspirations and priorities. In this case the purpose of the Draft Charter is to focus worldwide attention on the issues at stake and to encourage responsible preservation action wherever it can be taken.

The Draft Charter explains that the digital heritage

consists of unique resources of human knowledge and expression, whether cultural, educational, scientific or administrative, while embracing technical, legal, medical and other kinds of information that more and more are being created digitally, or converted into digital form from existing analogue resources.… Many of these resources have lasting value and significance, and therefore constitute a heritage that should be protected and preserved for current and future generations. This heritage may exist in any language, in any part of the world, and in any area of human knowledge or expression.

The purpose of preserving this heritage is to ensure that it can be accessed. The Draft Charter recognizes that this involves a tension and seeks a “fair balance between the legitimate rights of creators and other rights holders and the interests of the public to access digital heritage materials” in line with existing international agreements. It recognizes that some digital information is sensitive or of a personal nature and that some restrictions on access and on opportunities to tamper with information are necessary. Sensibly, it asserts the responsibility of each member state to work with “relevant organisations and institutions in encouraging a legal and practical environment which would maximise accessibility of the digital heritage.”

Threats to this digital heritage are highlighted, including rapid obsolescence of the technologies for access, an absence of legislation that fosters preservation, and international uncertainties about resources, responsibilities, and methods. Urgent action is called for, ranging from awareness raising and advocacy to practical programs that address preservation threats throughout the digital life cycle.

In discussing the measures that are needed, the Draft Charter emphasizes the importance of deciding what should be kept, taking account of the significance and enduring value of materials, and noting that the digital heritage of all regions, countries, and communities should be preserved and made accessible. It discusses the legislative and policy frameworks that will be needed and calls on member states to designate agencies with coordinating responsibility. It also calls on governments to provide adequate resources for the task.

Many agencies have a role to play, both within and outside governments. Agencies are urged to work together to pursue the best possible results and to democratize access to digital preservation methods and tools. The Draft Charter proposes a UNESCO commitment to foster cooperation, build capacity, and establish standards and practices that will help. Although this document is meant to inspire rather than dictate action, its adoption by UNESCO will be an important opportunity to raise digital preservation issues with governments and others who can influence how laws, budgets, and expectations are framed to help or hinder continuity of the digital heritage.


Guidelines for the Preservation of Digital Heritage

While the Charter focuses on advocacy and public policy issues, the Guidelines present practical principles on which technical decisions can be based throughout the life cycle of a wide range of digital materials. The Guidelines, prepared by the National Library of Australia on commission from the UNESCO Division of Information Society, have been published on the UNESCO CI (Communication and Information) Web site.

The guidelines address at least four kinds of readers with different but overlapping needs:

  • policy makers looking for information on which to base policy commitments regarding digital preservation
  • high-level managers who are seeking to understand the concepts of digital preservation and the key management issues their programs will face
  • line managers involved in making day-to-day decisions who need a more-detailed understanding of practical issues
  • operational practitioners responsible for implementing programs who need a perspective on how various practical issues and processes fit together as an integrated whole.

The structure of the guidelines is intended to make it easy for readers to find the information most relevant to their needs. The regional consultation process highlighted the fact that many people who feel they have a preservation responsibility are operating with very limited resources. Specific suggestions have been included to provide some starting points, although comprehensive, reliable digital preservation is a resource-intensive business.

Material in the Guidelines is organized around two approaches: basic concepts behind digital preservation (explaining concepts of digital heritage, digital preservation, preservation programs, responsibility, management, and cooperation) and more- detailed discussion of processes and decisions involved in various stages of the digital life cycle, including deciding what to keep, working with producers, taking control and documenting digital objects, managing rights, protecting data, and maintaining accessibility.

Although the guidelines were directly produced by the National Library of Australia, they were very extensively informed by input from reading and comments from a wide range of contacts, in addition to responsive comments from the formal consultation meetings. The text does not reflect any new research, but does try to reflect current thinking about the maintenance of accessibility, the core issue in digital preservation (although certainly not the only important issue).

For some readers the level of technical detail will be disappointing. The detail required to meet all the needs of practitioners is very situation-specific and quickly dated. As the Guidelines are intended to be useful in a very wide range of sectors and circumstances, the emphasis is on technical and practical principles that should enable practical decisions. It is to be hoped that UNESCO will complement the Guidelines with a Web site offering a growing body of technical details and tips aimed at specific sectors.

To give readers a sense of the approaches taken, a few of the principles asserted in the Guidelines, are appended to this paper. The UNESCO Guidelines for the Preservation of Digital Heritage will be published in a number of languages. At the time of writing, they are available in English from the UNESCO Web site.

Sample Principles from the UNESCO Guidelines for the Preservation of Digital Heritage

1. Not all digital materials need to be kept, only those that are judged to have ongoing value: these form the digital heritage.

3. Digital materials cannot be said to be preserved if access is lost. The purpose of preservation is to maintain the ability to present the essential elements of authentic digital materials.

4. Digital preservation must address threats to all layers of the digital object: physical, logical, conceptual, and essential.

5. Digital preservation will happen only if organisations and individuals accept responsibility for it. The starting point for action is a decision about responsibility.

6. Everyone does not have to do everything; everything does not have to be done all at once.

7. Comprehensive and reliable preservation programs are highly desirable, but they may not be achievable in all circumstances of need. Where necessary, it is usually better for noncomprehensive and nonreliable action to be taken than no action at all. Small steps are usually better than no steps.

8. In taking action, managers should recognize that there are complex issues involved. It is important to do no harm. Managers should seek to understand the whole process and the objectives that eventually need to be achieved and avoid steps that will jeopardize later preservation action.

15. Preservation programs must clarify their legal right to collect, copy, name, modify, preserve, and provide access to the digital materials for which they take responsibility.

24. Authenticity is best protected by measures that ensure the integrity of data is not compromised and by documentation that maintains the clear identity of the material.

26. The goal of maintaining accessibility is to find cost-effective ways of guaranteeing access whenever it is needed, in both the short- and long-term.

27. Standards are an important foundation for digital preservation, but many programs must find ways to preserve access to poorly standardised materials, in an environment of changing standards.

28. Preservation action should not be delayed until a single ‘digital preservation standard’ appears.

29. Digital data is always dependent on some combination of software and hardware tools for access, but the degree of dependence on specific tools determines the range of preservation options.

30. It is reasonable for programs to choose multiple strategies for preserving access, especially to diverse collections. They should consider the potential benefits of maintaining the original data streams of materials as well as any modified versions, as insurance against the failure of still-uncertain strategies.

32. Preservation programs are often required to judge acceptable and unacceptable levels of loss in terms of items, elements, and user needs.

33. Waiting for comprehensive, reliable solutions to appear before taking responsible action will probably mean material is lost.

34. Preservation programs require good management that consists largely of generic management skills combined with enough knowledge of digital preservation issues to make good decisions at the right time.

35. Digital preservation incorporates the assessment and management of risks.

39. While suitable service providers may be found to carry out some functions, ultimately responsibility for achieving preservation objectives rests with preservation programs and with those who oversee and resource them.

 


Highlighted Web Site

Digital Dog

Digital Dog is a training, consulting, and service business dedicated to digital imaging, electronic photography, and color management. The website provides a variety of free digital imaging tutorials, including a color management primer, scanner interface review, tips on calibrating digital cameras, and an "in the trenches" guide to image resolution. Many of the articles available on the Digital Dog site were written for Photo Electronic Imaging magazine, and contain reliable technical content presented in an accessible, down-to-earth style.

This site should be a valuable source of information for institutions involved in scanning projects, or who are looking for good digital imaging training materials. Some of the older articles are out-of-date, but there is a great deal of practical content available. Most of the tutorials are PDF documents, and will require the Acrobat Reader plug-in.

[Errata added 17 June 2003:
Dear Reader: Our choice of this site has proven to be an unintended object lesson in the volatility of Web resources. Within two days of our final URL check, the Digital Dog site was completely reorganized, changed domain names and most of its tutorial content disappeared. Though RLG DigiNews often covers efforts to preserve Web sites, we too are sometimes caught off guard by the swiftness and suddenness of their transformations.



print this faq

FAQ

Squeezing More Life Out of Bitonal Files: A Study of Black and White. Part III.

Your editor's interview in the December 2002 RLG DigiNews states that JPEG 2000 can save space and replace the multitude of file formats used for conversion and display of cultural heritage images but that it isn't suitable for bitonal material. We have lots of bitonal images. Is there anything similar available for them?

Part I of this three-part FAQ discussed general considerations for migration of scanned bitonal images away from TIFF G4, while Part II examined the characteristics of several alternative bitonal file formats and compression schemes that have become available during the past decade. In this, the final installment, we present the results of our experiences with several products for converting individual and multipage bitonal high-resolution TIFF G4s. Our coverage includes product specifications, general impressions, compression data, and sample images. Please note that some of the files require special plug-ins to be viewed. Instructions for downloading the necessary viewers are given below.

Test Image Selection

Though a bitonal image may seem like a simple affair, how well a particular image compresses depends on how it was scanned, the nature of its content, and the design of the compression scheme. Characteristics of the source image that can affect the rate of compression include:

  • the "cleanliness" of the scan (extraneous speckles lower compression)
  • the resolution of the scan (lower resolution lowers compression)
  • the use of multiple sizes and styles of text (more variation lowers compression)
  • the density of information present (less white space lowers compression)
  • the presence of fine detail, e.g., engravings or halftones (high complexity lowers compression)
  • the number of pages (fewer pages lowers compression)

Why do these factors affect compression? It helps to understand a little about how image compression is accomplished. Lossless compression depends on the recognition of patterns and the replacement of repeated elements with compact representations that exactly describe the feature being compressed. For example, instead of storing every bit in a scan line of all white bits, simply store a count of the white bits. Thus, sparse printing that leaves a lot of white space compresses well, while dense printing or highly speckled pages result in more transitions between black and white and thus less efficient compression.

The more sophisticated compression algorithms tested here take advantage of the fact that higher level elements are repeated within printed documents, including the symbols that make up the text. Thus, if a 12-point, Times Roman, non-bold, non-italic, non-underlined, lowercase 'a' appears in a document, its bitmap can be stored in a database and a subsequent appearance of the identical character can be replaced by a pointer to the database. This explains why clean, uniform typography compresses better than irregular, highly variant typography. Longer documents have an advantage because the algorithm "learns" more and more of the characters as it processes the text.

Halftones deserve a special mention. Bitonal halftoning is a printing process that simulates shades of gray by varying the size and spacing of black dots. Avoiding problems such as moiré (interference patterns) and poor contrast when scanning halftones bitonally requires the use of special processing algorithms (e.g. dithering or error diffusion). When done properly, the typical scanned halftone will be densely packed with data of a somewhat random nature, presenting a real challenge to compression algorithms.

Lossy but "visually lossless" compression attempts to remove elements that are redundant for human visual perception, producing an image that contains less information, but doesn't appear degraded.

We selected four images for in-depth testing, representing a variety of content type. We also tested 20-page sequences derived from the same works in order to average out anomalies, and give the compression algorithms a chance to show off their "learning curves."

The images are from three of Cornell's older collections: historic math books, NEH agriculture, and historic monographs. All images are bitonal 600 dpi TIFF G4s. If you follow the links for the individual pages from Table 1, you'll be taken to the image as it appears within the Cornell Digital Library—converted from TIFF to GIF, scaled down by a factor of six and enhanced with gray for improved legibility. The links for the 20-page groupings will bring up all 20 pages in GIF thumbnail mode, from which larger GIFs of the individual pages can then be accessed.

Table 1. Details of Test Images

Individual page Title/Author/Publication Date Important characteristics Individual Page Size (width x height) & number (click to see image in collection context) 20-page grouping (click to see thumbnails in collection context) Collection source
An elementary treatise on elliptic functions/Arthur Cayley/1895 Variable-sized text, heavy use of math symbols, clean scans with a fair amount of white space p. 16; 3120x5056 pixels (5.2" x 8.43") pp.1-20; images 19-38 Historic math books
The Modern Farmer in His Business Relations/Edward F. Adams/1899 Very uniform text, fairly dense, fairly clean p. 33; 3424x5184 pixels (5.71" x 8.64") pp. 22-41; images 26-45 NEH agriculture collection
The Mushroom, Edible and Otherwise: Its Habitat and Its Time of Growth/M. E. Hard/1908 About 45% text and 55% halftone illustrations, heavily speckled scans; the individual page consists almost entirely of a halftone p. lxvii (67); 4080x6000 pixels (6.8" x 10") lxvii-lxxxvi (67-86); images 69-88 NEH agriculture collection
The steam turbine, the Rede lecture, 1911/Charles A. Parsons/1911 About 50% text, 35% complex line art and 15% halftone illustrations; the individual page consists almost entirely of a complex line drawing p.27; 2832x4368 pixels (4.72" x 7.28") pp. 13-32; images 23-42 Historic monograph collection

 

Conversion software selection

Our software testing was limited to products that are open source or for which free evaluation copies are available. In some areas of computing, that might greatly constrain the selections, but in the specialized market niche of bitonal image conversion, it hardly cramped our style at all. We were able to test most of the important packages without spending a dime on software acquisition, which bodes well for anyone who wants to test these products on their own image collections.

As indicated in part II, we focused testing on products supporting three main technologies:

CPC (Cartesian Perceptual Compression)
: a file format and lossy compression scheme for bitonal images.

We fully tested CPC Tool from Cartesian Products, the only product available to encode this proprietary format.

DjVu: a file format supporting several compression schemes for bitonal, gray level, and color images. It supports both lossy and lossless bitonal compression. The bitonal compression algorithm is called JB2 and is similar to JBIG2.

We fully tested Any2DjVu , a Web service that allows files of many different formats (including TIFF G4) to be uploaded and converted to the DjVu format.

We also tested cjb2, a bitonal DjVu encoder that is part of the DjVuLibre package, an open source implementation of DjVu. Cjb2 only converts single pages. Although DjVuLibre comes with a utility (called djvm) that combines single DjVu pages into multipage DjVu files, it does not support font learning across pages. Thus we tested cjb2 only for the encoding of single pages.

There is also a commercial DjVu encoder, made by the format's owner, LizardTech, Inc., which we did not test. Currently available as part of LizardTech's Document Express 4.0, it is available in a trial version from LizardTech's Web site. The trial became available fairly late in our testing cycle and requires a special page cartridge (allowing the encoding of 250 pages) which we requested, but still had not received ten days later.

JBIG2: a lossless and lossy compression scheme for bitonal images only. JBIG2 does not specify a file format, but is often associated with PDF.

We fully tested two JBIG2 in PDF encoders, PdfCompressor from CVision Technologies and SILX from PARC (Palo Alto Research Center).

Another option for JBIG2 that we did not test is Adobe's Acrobat Capture with Compression PDF Agent.

Tables 2 and 3 provide additional details on the products tested.

Table 2. Product information (general)

Product and Version Producer Type of product Demo available/terms Platforms supported Viewer software (all freely available for downloading)
CPC Tool 5.1.x Cartesian Products, Inc. Commercial Yes/1000 file limitation when converting to or from CPC Windows (95 and up), MacOS X, Linux, various Unixes

CPC Lite (for Windows)

CoPyCat (for Mac, Linux and several Unixes—requires Acrobat Reader also)

Any2DjVu DjVu Zone Free Web service Yes/Response time will depend on size of file and how busy the service is; not meant for production use any platform that supports a graphical Web browser

LizardTech DjVu browser plug-in (Windows, Mac Classic, Mac OS X, and Unix)

DjVuLibre browser plug-in (Linux and Unix—part of the DjVuLibre distribution)

cjb2 from the DjVuLibre package 3.5.x   Open source Yes/freeware Windows 95 and up (also available as part of full DjVuLibre package here and here); Linux and Unix versions are here same as above
CVista PdfCompressor 2.1 CVision Technologies Commercial Yes/30 days or 1000 files (whichever comes first) and output has watermark and footer Windows (95 and up) Adobe Reader or PDFViewer browser plug-in (most platforms; must be at least version 5)
Silx 3.1 (previously called DigiPaper) PARC Solutions (Xerox PARC) Commercial Yes/90 days and output files have watermarks Windows (95 and up), Linux and Sun Solaris same as above

 

Table 3. Product features

Testing Protocols

How we tested

For each tool, we converted the four individual test pages from TIFF G4 to the supported target format. In the case of cjb2, the open source bitonal DjVu encoder, we first had to convert the TIFF G4s to pbm (portable bitmap) format, which we did with the free Windows application Irfanview. We also converted 20-page groupings derived from the same works as the individual test pages, except for cjb2, which only handles single pages.

Other than PdfCompressor, which runs only under Windows (CVision says the product will eventually support Solaris), all the tested products can be run under Windows, Linux or Unix. We ran the Windows version of cjb2, and the Solaris versions of CPC Tool and Silx. However, results should be the same regardless of the platform on which the conversions are carried out.

Each product offers options that affect the speed of conversion, display speed,