![]() |
|
|
|
|
|
|
|
|
|
||
![]() |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| June 15, 2003, Volume 7, Number 3 |
ISSN
1093-5371 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Feature
Article 2 Feature
Article 3 Highlighted
Web Site FAQ
Editor's Interview National Digital Information Infrastructure and Preservation Program Laura Campbell Editors’
Note
It’s wonderful that Congress has authorized and financially supported the next phase of NDIIPP, but how will the funds be committed? What percentage of funding will be spent on research, planning, implementation, evaluation, and other core areas?
Like
Russian Dolls: Nesting Standards for Digital Preservation This article introduces three standards for digital preservation, at least two of which feature prominently in the appendix of the plan Congress just approved.[1] Understanding what these standards are, what they can and cannot do, provides a solid foothold in present and future discussions surrounding long-term retention of digital materials, as well as a leg up on implementation.
Although all this probably sounds confusing in bulleted shorthand, it actually makes a lot of sense when properly laid out. This article walks through the standards one by one and elaborates on their functionality and interaction. As it works its way through the standards from the most general to the very specific, it will also home in on digital images as the files to be preserved. The expansive OAIS applies to any type of media, even nondigital materials, whereas METS applies exclusively to the digital realm of images, audio, and video. The NISO Data Dictionary focuses on technical metadata for digital still images. From a business perspective, digital preservation is a mechanism to ensure return on investment. Enormous amounts of money have been and are being spent on reformatting original materials or creating digital resources natively. If the cultural heritage community can not sustain access to those resources or preserve them, the investment will not bear the envisioned returns. Although a basic understanding of the general problems surrounding preservation in an ever-changing technical environment has started to permeate memory institutions, practical solutions to the challenge are slow to emerge. The three standards, OAIS, METS, and Z39.87, converge as a sustainable system architecture for digital image preservation. The space data community represents another group with enormous stakes in the long-term viability of its data. Capturing digital imagery of art or manuscripts may seem expensive, but the cost pales in comparison to that of gathering digital imagery from outer space. Under those circumstances, losing access to data is not an option. To foster a framework for preserving data gathered in space, the Consultative Committee for Space Data Systems (CCSDS) began work on an international standard in 1990. A good ten years later the OAIS was approved by the International Organization for Standardization (ISO).
In the standard’s own words, “[a]n OAIS is an archive, consisting of an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community.”[4] The standard formulates a framework for understanding and applying concepts in long-term preservation of digital information. It provides a common language for talking about these concepts and the organizational principles surrounding an archive. Though the OAIS pertains to both the digital and the analog realm, it has received the most attention for its applicability to digital data. As a reference model, the OAIS in and of itself does not specify an implementation—it does not tell you which computers to buy, which software to load, or which storage medium to use. The standard does tell you, however, how an archive should be organized. In its so-called functional model, it defines the entities (or departments, if you will) in an archive, their responsibilities, and interactions. The data flows between those entities and the outside world are specified in the information model, which delineates how information gets into the archive, how it lives in the archive, and how it gets served to the public. The OAIS leaves it up to every distinct community to flesh out an implementation of the high-level guidelines. For the cultural heritage community a number of OAIS-related documents exploring the framework’s application to libraries, museums, and archives have come out of the joint OCLC-RLG Preservation Metadata Working Group.[5]
As figure 1 illustrates, the OAIS stipulates that an archive (everything within the square box) interacts with a producer as well as a consumer. It takes in data from the producer through its ingest entity, and it serves out data to the consumer through its access entity. Within the archive itself, the data content submitted for preservation gets stored and maintained in the archival storage unit; data management maintains the descriptive metadata identifying the archive’s holdings. The OAIS dubs the data flowing between the different players information packages, or IPs. The data flows sketched out in figure 1 contain the following information packages:
The data represented by the information packages may vary according to the specific needs at each station: an archival information package, for example, probably contains more data aimed at managing the object than its more light-weight counterpart on the access side, the dissemination information package. Furthermore, the OAIS details several categories of information comprising a complete information package, but in keeping with its role as a reference model, it stops short of suggesting specific data elements or a specific encoding for the entire bundle of information.[6] Any community interested in implementing the OAIS has to identify or create a file-exchange format to function as an information package. For the cultural heritage community, METS shows great potential for filling that slot. METS wraps digital surrogates with descriptive and administrative metadata into one XML document. Digital surrogates in this context could be digital image files as well as digital audio or video. At the heart of each METS object sits the structural map, which becomes a table of contents for public access. The hierarchy of the structural map allows the navigation of media files embedded in, or referenced by, the METS object. It enables browsing through the individual pages of an artist’s book as well as jumping to specific segments in a time-based program, for example, a particular section of a video clip. These so-called digital objects encoded in METS have three main applications that conveniently align with their potential as OAIS information packages.
![]() Fig. 2. A METS object represented in the context of RLG Cultural Materials—a Chinese album from the Chinese Paintings Collection, contributed by the UC Berkeley Art Museum and Pacific Film Archive. The METS XML schema divides the standard into a core component and several extension components. The METS core supports navigation and browsing of a digital object. It consists of a header, content files, and a structural map. The METS extension components support discovery and management of the digital object. They consist of descriptive metadata and administrative metadata, which in turn split into technical, source, digital provenance, and rights metadata.
Figure 3 details the components of a METS object and one possible set of relationships among them.
The METS designers leveraged the combined power of the W3C specifications for XML schema and Namespaces in XML to create a flexible standard.
In this way, each community can plug in its own preferred descriptive elements as long as they have been formalized into a schema.[9] The visual resources community, for example, may choose to extend METS using the VRA Core, while libraries might be more inclined to stick with Metadata Object Description Schema (MODS) from the Library of Congress. Others may decide the Dublin Core (DC) satisfies their access needs. The flexibility achieved through namespaces gives METS the potential for implementation across a wide range of communities. The same logic applies to all components of administrative metadata. Each community has the opportunity to specify what data it deems most important for the management of its information, formalize those requirements into an XML schema, and use that schema as an extension to the hub-standard METS. For an example of a project that has identified or created a comprehensive suite of METS extensions, consult the Library of Congress AV Prototyping project. An alternative to embedding metadata for the extension components through XML Namespaces and external schemas consists in simply referencing the data from within the object. Descriptive or administrative metadata may live outside the XML markup in a database, to which the METS object can point. Even down to the level of media files, METS provides the dual option of referencing or embedding. The METS specification makes provisions for wrapping the actual bit stream of a digital file in the XML. In most cases, however, files live at online locations pointed to from within the object. In the realm of technical metadata, a fledgling NISO standard takes center stage for describing the different parameters of digital image files. As the NISO Data Dictionary—Technical Metadata for Digital Still Images, or Z39.87, the standard specifies a list of metadata elements. The Library of Congress, motivated by its AV Prototyping project, created an XML schema encoding for Z39.87, called NISO Metadata for Images in XML Standard (MIX). The XML schema constitutes the smallest Russian doll in our series of nesting standards, as it may be plugged into the METS framework as an extension schema for technical metadata. The standard also proposes fields for the source and digital provenance sections of METS. The NISO effort draws heavily on the Tagged Image File Format specifications, better know by their acronym TIFF. As the name implies, this format uses tags to define the characteristics of a digital file.[10] Image creation applications write the necessary parameters to the tags within the TIFF file, which means that the majority of the data Z39.87 covers already exists in file headers. To complete the metadata cycle, harvester utilities have to extract the information from the image file headers and import it into digital-asset-management systems for long-term preservation. By using the image file format specification as an integral part of the Data Dictionary, the standard leverages existing metadata to achieve cost savings. On the other hand, in going beyond the TIFF specifications for some elements, the NISO standard acknowledges information outside the TIFF scope that plays an important role in digital preservation. From this vantage point, the Data Dictionary becomes an important tool for educating vendors about the metadata our community sees as invaluable to preserve our investment. RLG is investigating the formation of a group advocating among digital camera-back vendors for the cultural heritage community’s metadata needs.[11] An industry standard for consumer digital cameras called DIG35 already has broad support among vendors. DIG35 allows transfer of information from the camera to the software utility that consumers use to manage their holiday snapshots. Building on that model, NISO Z39.87 in its XML instantiation MIX could become the file-exchange format to go between high-end scanners or camera backs and sophisticated asset-management databases. The Data Dictionary divides the technical metadata elements into four groups.
For any institution just starting out on the path of digital preservation, managing technical metadata through the NISO Data Dictionary is a great first step. The term data dictionary itself comes from the database community; it refers to a file defining the basic organization of a database down to its individual fields and field types. NISO Z39.87 represents a blueprint for a database or a database module that can be implemented fairly quickly—all the intellectual legwork has already been done by the standards committee. For expanding the database to include structural metadata relating files to each other, plus a descriptive record, as well as rights metadata, the database could be augmented by looking at METS and its extension schemas. Again, the Library of Congress AV Prototyping project offers a model implementation of a database using the METS approach. Scaling up to the bigger picture, this database could find its home in an archival environment specified by the OAIS.
To summarize: as illustrated by figure 4, the OAIS stipulates information packages, which find instantiation in METS; METS stipulates an extension schema for technical metadata, which finds an instantiation in Z39.87’s XML schema, MIX. Now, after the detailed review, the first bulleted list in this article should make a lot more sense. In
broad strokes, digital preservation with the nesting standards OAIS, METS,
and Z39.87 looks like a puzzle with all the pieces neatly falling into
place. In the details, however, some harmonization issues between the
standards remain. For example, the OAIS model breaks an information package
into different subcomponents than the
METS schema; the NISO Data Dictionary and its XML encoding MIX cover not
only the technical metadata extension of METS, but also some elements
that the digital object standard relegates to sections on source and digital
provenance. Nevertheless, the convergence of three standards developed
independently illustrates that a holistic view of digital preservation
is emerging. Only widespread implementation will tell whether the theory
as outlined by the standards can hold up in practice. Footnotes [2] National Information Standards Organization.(back)
[10] For the full TIFF tag library, see Appendix A of the format specifications.(back) [11]For more information about this fledgling initiative, please contact the author. Saving Digital Heritage—A UNESCO Campaign Colin
Webb
So begins an important new document being prepared for submission to the General Conference of UNESCO, the United Nations Educational, Scientific and Cultural Organisation. The Draft Charter on the Preservation of the Digital Heritage was positively received by a recent session of the UNESCO Executive Board, which asked for further consultations during preparation of a final draft for consideration. The Draft Charter is one very visible element in an international campaign to address the barriers to digital continuity and to head off the emergence of a second “digital divide,” in which the tools of digital preservation are restricted to the heritage of a well-resourced few. As well as the Charter, other elements of UNESCO’s strategy for promoting digital preservation include widespread consultations, the development of practical and technical guidelines, and a range of pilot projects. UNESCO has been critical in fostering the understanding and preservation of other kinds of heritage through avenues such as the World Heritage Convention and the Memory of the World program. Given the organisation’s commitment to the safeguarding of recorded knowledge evident in its Information for All program, it is not surprising that UNESCO has been concerned at the prospect of the loss of vast amounts of digital information. Digital technology’s immense potential for human benefit in so many areas—communication, expression, knowledge sharing, education, community building, accountability, to name just a few—is a tantalizing promise so easily denied by the lack of means, knowledge, or will to deal with its other great potential: rapid loss of access. The impetus for this campaign was embedded in a resolution passed by the UNESCO General Conference at its previous meeting in 2000. That resolution, drafted in part by the Council of Directors of National Libraries (CDNL), highlighted the need to safeguard endangered digital memory. Following that, as a basis for developing a UNESCO strategy, the European Commission on Preservation and Access (EPCA) was commissioned to prepare a discussion paper outlining the issues in digital preservation for debate. Consultation Process As well as circulating for comment the draft papers produced in the campaign to governments and nongovernment organisations and experts all over the world, the campaign has featured a number of regional consultation meetings convened specifically to raise issues of regional concern and to provide comment on the Preliminary Draft Charter and Draft Guidelines on the Preservation of Digital Heritage. The meetings were held between November 2002 and March 2003, in Canberra, Australia (for Asia and the Pacific); in Managua, Nicaragua (for Latin America and the Caribbean); in Addis Ababa, Ethiopia (for Africa); in Riga, Latvia (for the Baltic states); and in Budapest, Hungary (for Eastern Europe). All the meetings confirmed the need for urgent action and the great distance to be traveled before preservation of digital heritage is a reality in most countries. In total, around 175 experts and stakeholders from eighty-six countries participated in the five meetings, representing libraries, records archives, museums, audiovisual archives, data archives, producers and publishers of digital content, lawyers, universities and academies, governments, standardization agencies, community development organisations, computer industries, and researchers, among others. Draft Charter on the Preservation of the Digital Heritage Charters and declarations promulgated by UNESCO are meant to be “normative” documents that member states agree to through a vote of acceptance rather than by individual ratification. They are not binding and do not require any specific action on the part of governments, but they do express aspirations and priorities. In this case the purpose of the Draft Charter is to focus worldwide attention on the issues at stake and to encourage responsible preservation action wherever it can be taken. The Draft Charter explains that the digital heritage
The purpose of preserving this heritage is to ensure that it can be accessed. The Draft Charter recognizes that this involves a tension and seeks a “fair balance between the legitimate rights of creators and other rights holders and the interests of the public to access digital heritage materials” in line with existing international agreements. It recognizes that some digital information is sensitive or of a personal nature and that some restrictions on access and on opportunities to tamper with information are necessary. Sensibly, it asserts the responsibility of each member state to work with “relevant organisations and institutions in encouraging a legal and practical environment which would maximise accessibility of the digital heritage.” Threats to this digital heritage are highlighted, including rapid obsolescence of the technologies for access, an absence of legislation that fosters preservation, and international uncertainties about resources, responsibilities, and methods. Urgent action is called for, ranging from awareness raising and advocacy to practical programs that address preservation threats throughout the digital life cycle.
Many agencies have a role to play, both within and outside governments. Agencies are urged to work together to pursue the best possible results and to democratize access to digital preservation methods and tools. The Draft Charter proposes a UNESCO commitment to foster cooperation, build capacity, and establish standards and practices that will help. Although this document is meant to inspire rather than dictate action, its adoption by UNESCO will be an important opportunity to raise digital preservation issues with governments and others who can influence how laws, budgets, and expectations are framed to help or hinder continuity of the digital heritage.
The guidelines address at least four kinds of readers with different but overlapping needs:
The structure of the guidelines is intended to make it easy for readers to find the information most relevant to their needs. The regional consultation process highlighted the fact that many people who feel they have a preservation responsibility are operating with very limited resources. Specific suggestions have been included to provide some starting points, although comprehensive, reliable digital preservation is a resource-intensive business. Material in the Guidelines is organized around two approaches: basic concepts behind digital preservation (explaining concepts of digital heritage, digital preservation, preservation programs, responsibility, management, and cooperation) and more- detailed discussion of processes and decisions involved in various stages of the digital life cycle, including deciding what to keep, working with producers, taking control and documenting digital objects, managing rights, protecting data, and maintaining accessibility.
For some readers the level of technical detail will be disappointing. The detail required to meet all the needs of practitioners is very situation-specific and quickly dated. As the Guidelines are intended to be useful in a very wide range of sectors and circumstances, the emphasis is on technical and practical principles that should enable practical decisions. It is to be hoped that UNESCO will complement the Guidelines with a Web site offering a growing body of technical details and tips aimed at specific sectors. To give readers a sense of the approaches taken, a few of the principles asserted in the Guidelines, are appended to this paper. The UNESCO Guidelines for the Preservation of Digital Heritage will be published in a number of languages. At the time of writing, they are available in English from the UNESCO Web site.
Highlighted Web Site
FAQ Squeezing More Life Out of Bitonal Files: A Study of Black and White. Part III. Your editor's interview in the December 2002 RLG DigiNews states that JPEG 2000 can save space and replace the multitude of file formats used for conversion and display of cultural heritage images but that it isn't suitable for bitonal material. We have lots of bitonal images. Is there anything similar available for them? Part
I of this three-part FAQ discussed general considerations for migration
of scanned bitonal images away from TIFF G4, while Part
II examined the characteristics of several alternative bitonal file
formats and compression schemes that have become available during the
past decade. In this, the final installment, we present the results of
our experiences with several products for converting individual and multipage
bitonal high-resolution TIFF G4s. Our coverage includes product specifications,
general impressions, compression data, and sample images. Please note
that some of the files require special plug-ins to be viewed. Instructions
for downloading the necessary viewers are given below. Test Image Selection Though a bitonal image may seem like a simple affair, how well a particular image compresses depends on how it was scanned, the nature of its content, and the design of the compression scheme. Characteristics of the source image that can affect the rate of compression include:
Why do these factors affect compression? It helps to understand a little about how image compression is accomplished. Lossless compression depends on the recognition of patterns and the replacement of repeated elements with compact representations that exactly describe the feature being compressed. For example, instead of storing every bit in a scan line of all white bits, simply store a count of the white bits. Thus, sparse printing that leaves a lot of white space compresses well, while dense printing or highly speckled pages result in more transitions between black and white and thus less efficient compression. The more sophisticated compression algorithms tested here take advantage of the fact that higher level elements are repeated within printed documents, including the symbols that make up the text. Thus, if a 12-point, Times Roman, non-bold, non-italic, non-underlined, lowercase 'a' appears in a document, its bitmap can be stored in a database and a subsequent appearance of the identical character can be replaced by a pointer to the database. This explains why clean, uniform typography compresses better than irregular, highly variant typography. Longer documents have an advantage because the algorithm "learns" more and more of the characters as it processes the text. Halftones deserve a special mention. Bitonal halftoning is a printing process that simulates shades of gray by varying the size and spacing of black dots. Avoiding problems such as moiré (interference patterns) and poor contrast when scanning halftones bitonally requires the use of special processing algorithms (e.g. dithering or error diffusion). When done properly, the typical scanned halftone will be densely packed with data of a somewhat random nature, presenting a real challenge to compression algorithms. Lossy but "visually lossless" compression attempts to remove elements that are redundant for human visual perception, producing an image that contains less information, but doesn't appear degraded. We selected four images for in-depth testing, representing a variety of content type. We also tested 20-page sequences derived from the same works in order to average out anomalies, and give the compression algorithms a chance to show off their "learning curves." The images are from three of Cornell's older collections: historic math books, NEH agriculture, and historic monographs. All images are bitonal 600 dpi TIFF G4s. If you follow the links for the individual pages from Table 1, you'll be taken to the image as it appears within the Cornell Digital Library—converted from TIFF to GIF, scaled down by a factor of six and enhanced with gray for improved legibility. The links for the 20-page groupings will bring up all 20 pages in GIF thumbnail mode, from which larger GIFs of the individual pages can then be accessed. Table 1. Details of Test Images
Conversion software selection Our software testing was limited to products that are open source or for which free evaluation copies are available. In some areas of computing, that might greatly constrain the selections, but in the specialized market niche of bitonal image conversion, it hardly cramped our style at all. We were able to test most of the important packages without spending a dime on software acquisition, which bodes well for anyone who wants to test these products on their own image collections. As indicated
in part II, we focused testing on products supporting three main technologies: We fully tested CPC Tool from Cartesian Products, the only product available to encode this proprietary format. DjVu: a file format supporting several compression schemes for bitonal, gray level, and color images. It supports both lossy and lossless bitonal compression. The bitonal compression algorithm is called JB2 and is similar to JBIG2. We fully tested Any2DjVu , a Web service that allows files of many different formats (including TIFF G4) to be uploaded and converted to the DjVu format. We also tested cjb2, a bitonal DjVu encoder that is part of the DjVuLibre package, an open source implementation of DjVu. Cjb2 only converts single pages. Although DjVuLibre comes with a utility (called djvm) that combines single DjVu pages into multipage DjVu files, it does not support font learning across pages. Thus we tested cjb2 only for the encoding of single pages. There is also a commercial DjVu encoder, made by the format's owner, LizardTech, Inc., which we did not test. Currently available as part of LizardTech's Document Express 4.0, it is available in a trial version from LizardTech's Web site. The trial became available fairly late in our testing cycle and requires a special page cartridge (allowing the encoding of 250 pages) which we requested, but still had not received ten days later. JBIG2: a lossless and lossy compression scheme for bitonal images only. JBIG2 does not specify a file format, but is often associated with PDF. We fully tested two JBIG2 in PDF encoders, PdfCompressor from CVision Technologies and SILX from PARC (Palo Alto Research Center). Another option for JBIG2 that we did not test is Adobe's Acrobat Capture with Compression PDF Agent. Tables 2 and 3 provide additional details on the products tested. Table 2. Product information (general)
How we tested For each tool, we converted the four individual test pages from TIFF G4 to the supported target format. In the case of cjb2, the open source bitonal DjVu encoder, we first had to convert the TIFF G4s to pbm (portable bitmap) format, which we did with the free Windows application Irfanview. We also converted 20-page groupings derived from the same works as the individual test pages, except for cjb2, which only handles single pages. Other than PdfCompressor, which runs only under Windows (CVision says the product will eventually support Solaris), all the tested products can be run under Windows, Linux or Unix. We ran the Windows version of cjb2, and the Solaris versions of CPC Tool and Silx. However, results should be the same regardless of the platform on which the conversions are carried out. Each product offers options that affect the speed of conversion, display speed, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||