HomeAboutProjectsProducts & ServicesPublicationsSupport
RLG Logo
  Issue index
 
 
· Apr 15, 2007
 
 
· Dec 15, 2006
 
 
· Oct 15, 2006
 
 
· Aug 15, 2006
 
 
· June 15, 2006
 
 
· Apr 15, 2006
 
 
· Feb 15, 2006
 
 
· Dec 15, 2005
 
 
· Oct 15, 2005
 
 
· Aug 15, 2005
 
 
· Jun 15, 2005
 
 
· Apr 15, 2005
 
 
· Feb 15, 2005
 
 
· Dec 15, 2004
 
 
· Oct 15, 2004
 
 
· Aug 15, 2004
 
 
· Jun 15, 2004
 
 
· Apr 15, 2004
 
 
· Feb 15, 2004
 
 
· Dec 15, 2003
 
 
· Oct 15, 2003
 
 
· Aug 15, 2003
 
 
· Jun 15, 2003
 
 
· Apr 15, 2003
 
 
· Feb 15, 2003
 
 
· Dec 15, 2002
 
 
· Oct 15, 2002
 
 
· Aug 15, 2002
 
 
· Jun 15, 2002
 
 
· Apr 15, 2002
 
 
· Feb 15, 2002
 
 
· Dec 15, 2001
 
 
· Oct 15, 2001
 
 
· Aug 15, 2001
 
 
· Jun 15, 2001
 
 
· Apr 15, 2001
 
 
· Feb 15, 2001
 
 
· Dec 15, 2000
 
 
· Oct 15, 2000
 
 
· Aug 15, 2000
 
 
· Jun 15, 2000
 
 
· Apr 15, 2000
 
 
· Feb 15, 2000
 
 
· Dec 15, 1999
 
 
· Oct 15, 1999
 
 
· Aug 15, 1999
 
 
· Jun 15, 1999
 
 
· Apr 15, 1999
 
 
· Feb 15, 1999
 
 
· Dec 15, 1998
 
 
· Oct 15, 1998
 
 
· Aug 15, 1998
 
 
· Jun 15, 1998
 
 
· Apr 15, 1998
 
 
· Feb 15, 1998
 
 
· Dec 15, 1997
 
 
· Aug 15, 1997
 
 
· Apr 15, 1997
 
 


Click for printable version of this pagePrintable Version
 Contents of: Volume 8, Number 5 ISSN 1093-5371  Print entire issue
  Special Focus Issue: Introduction  
  Feature Article 1: Capturing Technical Metadata for Digital Still Images  
  Feature Article 2: PREMIS - Preservation Metadata - Implementation Strategies Update 1. Implementing Preservation Repositories for Digital Materials: Current Practice and Emerging Trends in the Cultural Heritage Community  
  Feature Article 3: Editors' Interview with Günter Mühlberger and Claus Gravenhorst of METAe  
  Highlighted Web Site: First Monday  
  Spotlight: Technical Metadata Harvesters  
  Calendar of Events  
  Announcements  
  Publishing Information  
 Special Focus Issue  Print this article only

Introduction

Author: Robin L. Dale - RLG (Robin_Dale@notes.rlg.org)

In the decade plus that cultural heritage institutions have been digitizing portions of their collections to provide enhanced access to an ever-growing community of users, the financial investment by cultural heritage institutions and supporting agencies has been huge. To date, the Library of Congress has digitized approximately two hundred thousand items in all media (books, manuscripts, photographs, films, sound recordings, newspapers, etc.).[i] Between 1998 and 2003, the Institute for Museum and Library Services (IMLS) leadership grant programs funded over 123 projects, creating more than two million digital objects in all media and specifically over 675,000 images and more than 1.3 million pages of text![ii] In the UK, a variety of organizations including the Joint Information Systems Committee, the Heritage Lottery Fund, the New Opportunities Fund, and the Higher Education Funding Council for England have provided funding for the digitization of cultural resources. And after the digitization of millions of cultural objects, a lesson has emerged: with the high cost of the digitization process and the stress it puts on sensitive materials, only access in perpetuity can justify the cost of digitizing collections.

Though early projects tended to digitize and document to a minimalist level, most cultural heritage institutions have now adopted a more mature approach of creating rich digital masters that will enable a multitude of uses over time. The digital images that museums, libraries and archives are creating have quickly become digital assets investments they need to manage and preserve, just as they need to manage and preserve their physical collections. But in the digital environment, the ability to manage and preserve information over time will be dependent upon metadata—both the kind of metadata and the level of detail collected. Until recently, this has been a potentially expensive proposition and is one to which most institutions have been reluctant to commit. Fortunately, a metadata framework for preserving our digital cultural heritage is emerging through the work of several international working groups, and projects and tools are being developed to facilitate more economic metadata capture and collection.

To highlight these issues that affect both the digitization and digital preservation of cultural heritage materials, this issue of RLG DigiNews will be devoted to the theme of preservation metadata.


[i] These figures were obtained from the Library of Congress Ask a Librarian service on 4 October 2004.

[ii] These are rough statistics supplied by Sarah Shreeves (University of Illinois at Urbana-Champaign), project coordinator of the metadata repository for IMLS digitization projects. Statistics are based upon a subset of IMLS grant projects thus far and granularity can be an issue in the way some digital objects have been reported. What is clear, however, is that statistics will be far higher when all inventories and data collections are complete.

 Feature Article 1  Print this article only

Capturing Technical Metadata for Digital Still Images

Authors: Robin L. Dale - RLG (Robin_Dale@notes.rlg.org), Günter Waibel - RLG (Guenter_Waibel@notes.rlg.org)

Introduction

Cultural heritage institutions have been actively digitizing their collections for more than ten years, and in that time countless standards – both de facto and de jure – have been created to facilitate enhanced, long-term access to these emerging digital collections. And while the development of technical specifications for digitization has been key, some of the most influential developments have been in the area of metadata.

Early on, the need for better methods of resource discovery sparked the formation of what is now the Dublin Core Metadata Initiative and standardization for resource discovery metadata began. It wasn’t until 1999 that questions about technical metadata began to emerge from cultural heritage institutions. Though most cultural heritage institutions had begun to digitize their collections at a rapid pace, few were consistently collecting metadata that would enable them to maintain the functionality and quality intrinsic to the images despite whatever preservation strategies might be applied over the long-term. Institutions attributed the problem to a lack of knowledge or standards specifying technical metadata for digital images.

In April 1999, the National Information Standard Organization (NISO), the Council on Library and Information Resources (CLIR), and RLG sponsored a workshop to examine the technical information needed to manage and use digital still images that reproduce a variety of pictures, documents, and artifacts. An outgrowth of that workshop was the development of NISO Z39.87 Technical Metadata for Digital Still Images, a data dictionary defining a standard set of technical metadata elements that would allow users to develop, interpret, and manage digital images for the long-term.

Technical Metadata for Images and NISO Z39.87

Why is technical metadata so important? Although technical metadata is only a subset of the complete suite of preservation metadata necessary to achieve the long-term viability of a digital asset, it has often been called the first line of defense against losing access. Technical metadata assures that the information content of a digital file can be resurrected even if traditional viewing applications associated with the file have vanished. Furthermore, it provides metrics that allow machines, as well as humans, to evaluate the accuracy of output from a digital file. In its entirety, technical metadata supports the management and preservation of digital images throughout the different stages of their life-cycles.

Technical metadata is necessary to support two fundamental functions: documentation of image provenance and history (production metadata); and assurance that image data would be rendered accurately on output (to screen, print, or film). Ongoing management, or “preservation,” of these core functions would require the development of applications to validate, process, refresh, and migrate image data against criteria encoded as technical metadata.

The NISO Z39.87 data dictionary covers four distinct categories of functions:

  • Basic image parameters record information crucial to displaying a viewable image.
  • Image creation metadata records information crucial to understanding the technical environment in which a digital image file was captured.
  • Imaging performance assessment metadata records information that allows evaluation of the digital image’s quality, or output accuracy.
  • Change history metadata records information about the processes applied to an image over its life cycle.

NISO Z39.87 has been available for use since 2002 under the status of Draft Standard for Trial Use (DSFTU). The data elements in the DSFTU version relied heavily upon information found in TIFF files and while useful, this convention has become somewhat limiting with the advent of file formats such as JPEG2000.

Currently, Z39.87 is undergoing significant revision so that it can better and more accurately document the range of file formats that institutions are collecting and managing. Data elements in the revised version build and expand on the DSFTU versions, including technical metadata available in TIFF, TIFF/EP, JPEG, and JPEG2000 file formats, as well as metadata elements from the Digital Imaging Group’s1DIG35 metadata element set and the EXIF specification.

Although Z39.87 itself was designed to be agnostic in terms of implementation, the NISO Metadata for Images in XML Schema (MIX), commissioned by NISO and created by the Library of Congress, has been the dominant form of use for the data dictionary. Because MIX is a Metadata Encoding and Transmission Standard (METS) extension schema, implementation and use of the data dictionary on a local level has been fairly easy to manage. Surveys have also shown that Z39.87 has informed the creation of local metadata element sets, including contributing to the formation of broader preservation metadata elements sets.2

The OCLC-RLG Preservation Metadata: Implementation Strategies (PREMIS) working group is in the process of creating a comprehensive data dictionary for preservation metadata independent of file format. Since technical metadata is file-format specific, the revised version of Z39.87 will complement the all-encompassing PREMIS effort for data sets comprised of digital images. However, putting digital preservation into practice still requires economic ways of data capture.

Collecting Technical Metadata to Support Preservation

Despite the importance of technical metadata, community action to gather the necessary metadata has been slow to come. There are two related reasons for this: the inability of many capture devices to record some of the technical metadata desired; and the largely manual process that most institutions have been relying upon to gather and document metadata.

Most of the information available about capture devices and metadata recording is related to digital cameras and comes from product reviews or recent surveys conducted by Kodak.3 From these observations and reviews, it is clear that the full range of metadata from any of the related standards is underutilized. At best, most cameras – consumer and professional levels – capture some core TIFF elements, very few of the available EXIF “camera capture” elements, and a few additional elements categorized as “GPS tags” and “Thumbnail tags.” Surprisingly however, current cameras labeled as “consumer cameras” were more likely to record more information than the “professional cameras” offered by the same company. Recent conversations with members of the I3A/IT10 (Electronic Still Picture Imaging) standards committee reveal that the upcoming adoption of JPEG2000 (and in particular, the JPX file type) by some digital camera manufacturers promises to allow greater technical metadata capture, but this prospect may apply only to manufacturers willing to adopt JPEG2000 as optional file formats from the device. More work must be done to ensure that device manufacturers are recording and making available the metadata and file formats needed by the cultural heritage community.

The actual collection of metadata and the detail to which an institution documents its digital collections is also a significant problem. As a generally labor-intensive activity, institutions have routinely collected and documented minimal metadata to reduce the overall cost of creating and storing the collection. But this cost-cutting measure is potentially short-sighted. Will an institution have the capability to render its digital files over time? Or more critically, will an institution have enough information to perform appropriate preservation measures and keep information viable? Do we really know how much metadata we will need in order to preserve image files for the long-term?

The answers to these questions are not yet known. It is unlikely that we will soon know exactly how much metadata is really needed to support future access and management of digital images though we know that current practice of minimal metadata collection is unlikely to be enough. Several institutions have begun to perform preservation actions on certain image files and whispers are beginning to be heard regarding the necessity of enough appropriate metadata to perform the tasks at hand. But how can an institution acquire enough of the correct technical metadata to hedge bets over time and facilitate the economic creation of digital collections? (Even Z39.87 is comprised of approximately 125 metadata elements and approximately 40 of those elements are mandatory.) The only realistic and feasible answer is to automate metadata collection and extraction to the extent possible.

Automating Technical Metadata Collection: The Automatic Exposure Initiative

The first step toward automating technical metadata collection is the identification of a target metadata element set. The Z39.87 element set fills that role as a recognized, community-created, soon-to-be international standard.4 The accompanying MIX schema serves as a placeholder for institutions to record the information, especially within METS. Problem one solved. Yet until recently, two problems continued to impede progress on automated metadata extraction: the inability of capture devices to record the range of technical metadata required to support long-term preservation and management; and the inability to easily expose and capture the metadata that does exist in digital image files. To address both of the remaining problems, RLG formed the Automatic Exposure initiative.

The Automatic Exposure initiative helps institutions meet the technical metadata imperative by pursuing a variety of implementation strategies. The initiative engages manufacturers of high-end scanners and digital cameras in a dialog about how their products can automatically capture technical metadata and make it available for transfer into digital repositories and asset management systems. Furthermore, it identifies existing or emerging technologies for harvesting technical metadata developed at individual institutions or by the industry, and explores how those tools could be leveraged to serve the entire community. NISO as the custodial home for NISO Z39.87 co-sponsors the initiative, and the Digital Library Federation (DLF) and the Museum Computer Network (MCN) pledged their support from the outset.

In the first phase of Automatic Exposure, RLG distributed an informal survey in June 2003 to identify stakeholders, current practices and common digital capture devices in the community. Despite limited circulation, well over 100 responses were received. In summary, the responses verified that capturing technical metadata tends to be a manual, time-consuming process. All of the responding institutions wholeheartedly subscribed to the value of recording technical metadata, yet only a minority had the ability to capture the technical properties of files even at the most basic level.

The survey responses drove subsequent activities, including a white paper outlining the problems, solutions, and opportunities to be investigated as a part of the Automatic Exposure initiative. Further, the survey identified the prevalent digital capture devices used in the cultural heritage community, thereby providing a “shortlist” of manufacturers with which to work over the course of the initiative. Finally, the white paper identified a number of software programs that have been developed by or for cultural heritage institutions to help them expose and export technical metadata for image files. The compiled list represents significant work of the community to create a suite of tools necessary to support digital preservation. More importantly, most of these tools have been developed using the open source model and are available for use by other cultural heritage institutions.

New Dialog with Device Manufacturers

In the dialog with manufacturers, the project has aimed to find common interests in recording metadata and making it available for further processing, such as ingest into preservation systems. While the cultural heritage community has defined a standard metadata element set for digital preservation in Z39.87, the industry has launched a number of initiatives that promise to deliver self-describing digital files, or files that carry within their code vital information about their origination, content, access rights, etc. In some instances, these initiatives propose metadata element sets that include tags relevant to digital preservation (such as DIG35 or EXIF); in other instances, they propose specific or generic transfer mechanisms for self-describing metadata (such as the XML box in JPEG 2000’s JP2 and JPX file types or Adobe’s eXtensible Metadata Platform, Adobe XMP). The industry at large and the manufacturers of digital capture devices have already made an investment by developing and implementing some of these technologies, though a review of most industry initiatives revealed that none of the existing specifications delivers the complete metadata set crucial for digital preservation as outlined in NISO Z39.87.5

Over the course of the initiative, many device manufacturers have responded positively to the invitation to participate in the initiative, among them Betterlight, Creo/Leaf, HP, Kirtas Technologies, Kodak, Sinar Bron, and PhaseOne. Assistance from experts such as Franziska Frey (Rochester Institute for Technology) and Don Williams (Eastman Kodak) have been instrumental in further connecting this community-based effort with industry. Although no promises have been made, each of the above-named device manufacturers has expressed interest in responding to the needs of this community. Future hardware and software developments from these manufacturers will tell the story and cultural heritage institutions are urged to contact device manufacturer representatives and emphasize their needs.

Looking to the Future New File Formats and New Tools

Recently, several device and imaging software manufacturers have announced plans to develop new, “archival” file formats such as the newly introduced Adobe Digital Negative (DNG) raw file format or those that will be created through the new Picture Archiving and Sharing Standard (PASS) group. We hope that this convergence in interests presents us with the opportunity to work with the new initiatives so that future file formats can supply a complete Z39.87 technical metadata set.

At the same time, the community cannot afford to wait for a “magic bullet” file format because it is unlikely to come. Instead, institutions should become familiar with existing tools that will assist with metadata exposure and extraction. The Spotlight below contains a list of metadata harvesting tools that are available for use now. In addition, RLG will be releasing two new tools as a part for the Automatic Exposure initiative. The first tool, the “Automatic Exposure Scorecards,” will profile and review the available technologies for capturing technical metadata. Scorecards will be available on the Automatic Exposure web site in the coming months. A second tool under development is a Z39.87-Adobe Extensible Metadata Platform (XMP) panel to allow the extension of the metadata handling capabilities of Adobe Photoshop, a commonly used software package in the cultural heritage digitization process. When completed, this tool will be announced in a future issue of RLG DigiNews and will be freely available on the RLG web site.

1 The Digital Imaging Group (DIG) merged with the Photographic and Imaging Manufacturers Association (PIMA) to form the International Imaging Industry Association (I3A) in July 2001.

2 Both the Automatic Exposure survey and the survey conducted by the Implementation subgroup of the PREMIS working group found this alternate use of the Z39.87 data dictionary.

3Automatic Exposure: Capturing Technical Metadata for Digital Still Images. Mountain View, CA: RLG, 2004. See Appendix 3: Kodak’s Professional Camera Metadata Survey (2002), Appendix 4: Kodak’s 2002 Consumer Digital Camera Metadata Survey, and Appendix 5: Kodak’s 2003 Consumer Digital Camera Metadata Survey.

4 Though the NISO acronym properly translates to the National Information Standards Organization, the standards this organization creates are largely those utilized in the networked environment. The restriction to “national” in this sense is untrue. In fact, the standards this organization creates and supports have international support and impact.

5 Though none of the existing specifications delivers the complete technical metadata set as outlined in NISO Z39.87, all of them cover at least some of the data elements specified there.


 Feature Article 2  Print this article only

PREMIS - Preservation Metadata - Implementation Strategies Update 1. Implementing Preservation Repositories for Digital Materials: Current Practice and Emerging Trends in the Cultural Heritage Community

Author: Priscilla Caplan - Florida Center for Library Automation (FCLA) (pcaplan@ufl.edu)

Editors’ Note: The work that PREMIS is doing is of great importance and interest for the digital preservation community. This is the first update on their activities with more to follow in future issues.

In 2003 OCLC and RLG established an international working group to develop a common, implementable core set of metadata elements for digital preservation. Most published specifications for preservation-related metadata are either implementation-specific or broadly theoretical. PREMIS (Preservation Metadata: Implementation Strategies) was charged to define a set of semantic units that are implementation-independent, practically-oriented, and likely to be needed by most preservation repositories. The group was also charged to examine alternative strategies for the encoding, storage, and management of preservation metadata within a digital preservation system, to investigate the exchange of preservation metadata between systems, and to explore opportunities for the cooperative creation and sharing of preservation metadata.

To aid in this work, the group conducted a survey of cultural heritage institutions using, developing, or planning development of a preservation repository. They asked repositories questions about their mission and services; their models and policies; their architecture, storage and preservation strategies; and their metadata practices. The survey report, Implementing Preservation Repositories For Digital Materials: Current Practice And Emerging Trends In The Cultural Heritage Community, was recently published on the PREMIS website.

The report summarizes the practices of forty-eight respondents in thirteen different countries. Twenty-eight libraries, seven archives, three museums, and eleven other types of institution replied to the survey. Sixteen of the respondents were also interviewed by telephone. The survey was distributed in November 2003 and responses were accepted until March 2004, so the data describes the situation in the winter of 2003/2004.

Insitution Count Percentage
Libraries 28 58%
Archives 7 15%
Other 14 29%
Table 1. Type of Institution, PREMIS Report, p.12

The report shows that at the time the survey was taken, there was relatively little experience with digital preservation. Only eleven institutions, none of them academic libraries, claimed to have implemented an active preservation strategy such as forward migration, normalization, or emulation. Twice that many respondents, however, reported having a preservation repository system at least partially in production.

Preservation Repository State Count Percentage
Planning and Organizational stage 18 38%
Development (alpha, beta) 16 33%
Production 22 46%
Table 2. Stage of digital preservation development, PREMIS Report, p. 13

Metadata practices showed some high-level consistency. Nearly all respondents recorded metadata in multiple categories: rights and permissions, provenance, technical metadata, administrative and management information, descriptive metadata, and structural metadata.

Categories Count Percentage
Rights and permissions 37 77%
Provenance (document history) 40 83%
Technical metadata 41 85%
Administrative and management information 41 85%
Bibliographic/descriptive 38 79%
Structural metadata 36 75%
Other 9 19%
Table 3. Categories of recorded metadata, PREMIS Report, p. 44

More than half the respondents used METS, and a quarter used the NISO/AIIM Technical Metadata for Digital Still Images specification. At the same time, there was not much evidence of consistency of practice at a more detailed level. Most respondents used multiple metadata schemes, and/or created local schemes selecting elements from other published schemes. Thirty-three different metadata schemes were mentioned, not including purely local inventions.

Schema Count Percentage
AUDIOMD: Audio Technical Metadata
Extension Schema
3 6%
CEDARS 7 15%
Creative Commons Metadata 5 10%
METS 26 54%
MIX or Z39.87 12 25%
MPEG7 1 2%
MPEG21 4 8%
NEDLIB 7 15%
National Library of Australia 6 13%
National Library of New Zealand 7 15%
OCLC Digital Archive Metadata 11 23%
TEXTMD: Schema for technical metadata for text 7 15%
Schema for rights delcaration (METSRights.xsd) 5 10%
VERS 2 4%
VIDEOMD: Video Technical Metadata
Extension Schema
3 6%
Other 22 46%
Table 4. Utilization of metadata schemes, PREMIS Report, p. 45

In general, the survey shows a picture of a community trying to take advantage of prior work but not at the point of developing or settling on dominant standards. The need for additional standards was clear, especially for technical metadata pertaining to various file formats.

Three-quarters of all repositories obtained metadata from their depositors and the same number extracted metadata automatically by program. Nearly two-thirds of the respondents also had some metadata supplied by repository staff, either through manual data entry or by automatic derivation from bibliographic databases. Automatic extraction by repository software was most often limited to technical metadata – size, file format, and file characteristics stored in file headers. Most respondents indicated they hoped or expected that metadata creation would be automated as fully as possible in the future.




Method of Obtaining Libraries Archives Other Total Percentage
Supplied by depositor 22 6 8 36 75%
Extracted by program 20 6 10 36 75%
Supplied by repository
staff
16 6 8 30 63%
Other 34 71%
Table 5. Methods of obtaining metadata, PREMIS Report, p. 49

Most respondents recorded information about collections, logical objects, files, bitstreams, and metadata itself. Responses showed that respondents understood these distinctions and recorded metadata appropriate to each type of entity. Respondents stored metadata in a variety of ways: in relational databases, in XML databases, in object-oriented databases, in proprietary databases, in flat files, and bundled with related content files. Most respondents (60%) stored metadata in two or more ways, most commonly in relational tables and bundled with content.

Storing Method Count Percentage
In a relational database 33 69%
Bundled with related content files 22 46%
In an XML database 13 27%
In a proprietary database of format 11 23%
In flat files 9 19%
In an object-oriented database 2 4%
Table 6. Metadata storage methods, PREMIS Report, p. 51

Fewer than half of the respondents had adopted an approach to controlling normalized or migrated versions of archived materials. Respondents recognized the importance of recording relationships among files stored in the repository (e.g., access and preservation copies, versions, derivatives) but these relationships were recorded using many different mechanisms, including metadata, directory structures, identifiers, and file naming conventions.

The survey report is only the first PREMIS deliverable. The group is also working on a supplementary report based on a facet analysis of the survey data. The major product, of course, will be the core element set itself, which is due this winter. This will focus primarily on metadata pertaining to objects, events (digital provenance), and relationships among objects. Some rights information may also be included. Purely descriptive metadata and detailed attributes of agents are not addressed, because there are other efforts centered on these. Format-specific technical metadata are also not addressed, because this was outside the expertise of the group.

The metadata will be represented in a data dictionary that will include, for each core element, a definition, rationale, examples and notes, and rules such as data constraints, obligation and repeatability. The group is attempting to be as informative as possible while not presupposing any particular implementation. For example, the data dictionary will differentiate between metadata that applies to three different types of digital objects: representations that can consist of multiple files, files, and bitstreams embedded within files. However, it does not prescribe or presuppose how file-specific information is stored (e.g., with every object, in tables keyed to file format, in a file format registry, etc.).

To some extent, however, no data dictionary can represent the richness of the PREMIS discussions. For example, it could be illuminating to know that a particular metadata element was discussed and not considered core. For this reason the group is planning to embed the data dictionary in a full report that includes key discussion points, a data model and a glossary.

For more information see “Implementing Metadata in Digital Preservation Systems: The PREMIS Activity” by Brian Lavoie and the PREMIS website at http://www.oclc.org/research/projects/pmwg/.


 Feature Article 3  Print this article only

Editors' Interview with Günter Mühlberger and Claus Gravenhorst of METAe



Claus Gravenhorst is with Content Conversion Specialists (CCS), and Günter Mühlberger is at the University of Innsbruck. They are members of the team that developed METAe.


 

You have been working on the Metadata Engine Project (METAe) for a number of years (see RLG DigiNews Volume 6, Number 3). Your project has been described by some as a model of collaboration involving libraries, other cultural heritage institutions and IT companies working together to integrate library standards into a software development process. Would you describe that collaboration and how it worked?

Günter Mühlberger: The main idea – to bring together universities, IT companies and users (in our case: libraries) – is based on the principle design of the 5th Framework Research Programme of the European Community, the Information Society Technologies programme. The EU is supporting projects to develop prototypes that are near to market release and can be launched after the end of the R&D stage. This is a risky task but provides good opportunities to realise innovative ideas. One of the most important factors for success, apart from the fact that considerable financial input is provided by the funding agency (METAe: 1,5 mill. EUR), is the model for collaboration: a project will be successful if the interests of partners are fully represented and well-balanced. As with the Metadata Engine project, universities need to be able to publish great papers and release exciting prototypes; companies want to launch new products/services and to enter new markets; and libraries need to see the chance to realise some of their desired functionality, e.g., the use of standards within software tools or semi-automated processes in place of manual actions.

What factors do you think led to this successful partnership/development?

Claus Gravenhorst: The strong interaction with participants of the METAe project and especially their input transformed the idea into a useful application. There have been a lot of meetings with all the members and they evaluated the application from an early stage on. For example, Cornell University Library contributed a lot during the METAe project. Oya Rieger, of the METAe project group, hosted a workshop in Ithaca in September 2003 where some of the major US university libraries participated. The group discussed METS/ALTO as standard output and had a hands-on training session on docWORKS/METAe.

However, I believe another important reason for the success of the METAe project was the strong effort of Günter Mühlberger, coordinator of the METAe project at the University of Innsbruck, and his team, as well as Ralph Tiede and myself, METAe project managers at CCS.

Another key to success was the fact that the vendors ABBYY and CCS aimed at an application ready to be distributed in the market.

How could this process be replicated?

Günter Mühlberger: It is obvious that the future is digital and that all kinds of material within our daily environment will be transferred one way or another into a series of digits. To cope with this situation, there are tangible considerations and efforts to extend the approach of the Metadata Engine project towards other sectors of digitisation as well, such as archival material, video/audio, or even 3D.

You began early with an academic/industry partnership and project, which has turned into a product that institutions are using. Because of the early partnership, you have paid close attention to relevant standards within our community — especially metadata standards—and you have incorporated them into the software. As far as we know, docWORKS utilizes Dublin Core, NISO Z39.87 for technical metadata, and produces METS files. Are there other examples of metadata standards implementation? Do you have plans for more? Could you describe how the software has continued to develop around emerging metadata standards?

Claus Gravenhorst: During the METAe project, we learned that there is no standard to handle word positions and physical layout information (print space, margins, etc.), an essential feature for high performance repositories that are able to highlight elements within documents. Therefore, the ALTO schema has been developed. In the METS file, there are file pointers to the ALTO files that contain the text, other elements (illustrations, etc.), and word positions. We would like ALTO or a similar schema to become a standard as we do not see an alternative right now.

Beside Dublin Core, MODS metadata became more interesting, as the MODS schema is more flexible and able to include a lot more structural metadata. An institution itself decides if it wants docWORKS to create either Dublin Core or MODS metadata.

Starting from the METS/ALTO output, which contains the richest variety of data we could imagine, other standards could be created easily using .xsl style sheets. For example, we are able to create TEI files or structured PDF with hidden text using the METS/ALTO output.

An important issue is structural metadata. We just talked to Nancy Hoebelheinrich of Stanford University Library about this and shared the structural map we use in the METS files. We are confident that this will support a METS community discussion about the possibilities for building generally agreed upon structures for certain kinds of digital objects. [Editors' note: For more information about this structural map contact CCS.]

Another essential subject is a metadata standard for newspapers, as we have noticed an emerging interest in the digitization of historic newspapers. Since CCS originally developed its content conversion software for press clipping and press reviews, we are aware that newspapers differ — especially in terms of structure compared to books, journals, or theses and dissertations. We are aware that IPTC worked hard to develop newsML. From our point of view, it is important to have a standard that is able to handle logical and especially physical structure of newspapers down to the article level.

Is there anything new on the horizon?

Günter Mühlberger: Library materials, such as books, journals or newspapers are relatively well-structured. But archival material, even if we just focus on paper-based collections, is extremely heterogeneous, often fragile and of poor quality — but nevertheless highly interesting, in high demand by users, and often rare or even unique. Obviously tools will be needed to increase automated processes in this field as well. For example, one could think about a software-supported indexing system where an archivist or expert will describe orally in a structured way archived items and a speech recognition software will translate it into an EAD record. At the end the user will be able to search the EAD record, but also to retrieve the audio file with the description of the archivist.

Where can readers get more information?

Claus Gravenhorst: In addition to our website, the website about the METAe project is worth reviewing.

Are there things you would do differently next time?

Günter Mühlberger: When designing the project we focused on an in-house digitisation model and did not realise that many libraries are using service providers for their digitisation projects. Given that oversight, we missed an opportunity to have a service provider company on board. We had a biased perception of “what is really going on,” which led to limited requirements for the software packages. The lesson learned is that representatives of all relevant players need to be on board.

Claus Gravenhorst: For a successful project, covering the complete value chain is important: issues like scanning services, repository and presentation systems, distribution, and maybe even payment systems should be addressed.


 Highlighted Web Site  Print this article only

First Monday



http://www.firstmonday.org/

October 2004 marks the 100th issue of First Monday, a peer-reviewed journal on the Internet about the Internet. The journal, part of the Great Cities Initiative of the University of Illinois at Chicago Library, publishes thought provoking articles in formats such as essays, original research, and case studies, and reaches a diverse and global audience. The journal content cuts a wide swath through topics related to the Internet—its infrastructure and as source of information, social, and economic exchange. Neat search features point to past articles by keyword, by article popularity, and even by random selection. Keyword searches are ranked by a match score. We ran a couple of searches against topics that are of interest to RLGDigiNews readers and here are the results:

Number of results with…

Keywords

> 70% match score

> 90% match score

Digital Preservation

113

13

Digitization

44

11

Metadata

5

2


 Spotlight  Print this article only

Technical Metadata Harvesters



According to the results of the Automatic Exposure survey, the practices of cultural heritage institutions are far from uniform when it comes to exposing and obtaining technical metadata from image files. Some institutions are extracting the metadata from image file headers through a fairly manual process. Still others utilize a variety of available software packages to extract some file header information in combination with the manual logging of other information. Finally, some institutions have developed fairly sophisticated methods and promising tools to harvest the technical metadata currently recorded during the capture process.

Although these are positive responses to a difficult problem, access to more sophisticated solutions is not uniformly available. Not all institutions will have the infrastructure in place to allow for local development of such mechanisms or even the modification of another institution's "solution." As well, these varied approaches lack uniformity in the kinds of data being harvested, since each protocol is likely to be built for a specific institution or instance. What is needed is a suite of tools that can be made available and applicable to all types of institutions, affording all cultural heritage institutions the ability to preserve their digital images.

The following tools were identified as in use during the Automatic Exposure survey and survey follow-up. Although most are being locally developed, a commercial software tool is included below because it was identified as in use at more than one cultural heritage institution.

  1. JHOVE: The JSTOR-Harvard Object Validation Environment

    JSTOR and the Harvard University Library collaborated on a project to develop an extensible framework for format validation: JHOVE (pronounced "jove"), the JSTOR/Harvard Object Validation Environment.

    JHOVE provides functions to identify, validate, and characterize digital objects. JHOVE has three main operational modes:

    • Format identification is the process of determining the format to which a digital object conforms; in other words, it answers the question: "I have a digital object; what format is it?"
    • Format validation is the process of determining the level of compliance of a digital object to the specification for its purported format, e.g.,"I have an object purportedly of format F; is it?"
    • Format characterization is the process of retrieving the significant properties of an object of format X.

    The third mode (characterization) was most relevant to the Automatic Exposure survey. Essentially, this is the process of extracting pertinent technical characteristics from the digital object itself. In the case of raster image formats—TIFF in particular—a significant subset of the NISO Z39.87 metadata may be made available in this manner.

    The latest release of JHOVE includes modules for arbitrary byte streams, ASCII and UTF-8 encoded text, GIF, JPEG, JPEG2000, and TIFF images, AIFF and WAVE audio, PDF, and XML. The TIFF module recognizes the various public profiles of TIFF: version 4.0 through 6.0; Baseline 6.0 bi-level, grayscale, palette, RGB, and CMYK; TIFF/IT, TIFF/EP, EXIF, GeoTiff, etc. (Links to detailed specifications for each module can be found on the JHOVE documentation page). Options for output include a simple text display (label: value), a "standard" XML schema applicable to all formats, using an RDF-like syntax to display complex nested property structures; and for still image formats, MIX-compliant XML. In short, if cameras and/or scanning devices properly embed preservation metadata within the TIFF file, JHOVE can automatically extract it.

    The current JHOVE release (beta 2) is considered pre-release software and is identified as available for review and testing purposes only.

  2. National Library of New Zealand Metadata Extract Tool

    The National Library of New Zealand has commissioned the development of a Metadata Extract tool. Based on the Library’s preservation metadata data model, the Java/XML tool comprises a generic application and a number of "adapters" developed to extract the data from specific file types. To date adapters have been written for MS Word 2, MS Word 6, TIFF, WAV and BMP. Development of further adapters is planned.

    The tool extracts data from within the header and surrounding directory structure as necessary, then maps that data into the Library's preservation metadata data model format ready for ingest into the Library's metadata repository.

    The tool is designed for use by the wider digital preservation community and it is hoped that its future development will be informed by that community. The Metadata Extract Tool was recently short-listed for the 2004 Digital Preservation Award. That award was sponsored by the Digital Preservation Coalition and is one of a number of awards offered as part of the Pilgrim Trust Conservation Awards 2004. For more information, contact Steve Knight.

  3. National Library of Australia Digital Collection Manager (DCM)

    The DCM is a database application that supports digitization workflows including upload and download of files to and from the Library's Digital Object Storage system (DOSS), and was developed as part of the Digital Services Project.

    The system records management and technical metadata about digital collection items, including relationships between parts of a work and between various copies of those parts (e.g. originals, masters, view copies etc.), records process information about creation of copies, and, for images, extracts relevant technical metadata from file headers. (The system is currently being extended to also manage digital audio collection workflows and objects).

    A summary of the Digital Services Project with brief description of the DCM, as well as the functional specification including the data model for the DCM are available. A paper about the Library's digital services architecture includes some screen shots of the DCM along with listings of some of the supported metadata elements. For further information, please contact Judith Pearce.

  4. OPUS

    OPUS is a commercial product from the Digital Library Systems Group at Image Access. Built upon another Image Access product, BSCAN, OPUS has been several years in the making, and is based in part on prototypes developed in cooperation with university digital library programs.

    OPUS is designed to work with flatbed or planetary scanners to manage imaging workflow, including scanning, image post-processing, derivative creation and metadata creation. OPUS supports multisource metadata input, including technical metadata from image headers and descriptive and structural metadata via OCR and intelligent interpretation of scanned images. Metadata can be output to custom and standard formats including METS XML. A built-in, proprietary scripting language allows for customized solutions.

    OPUS is available in single- and multi-station configurations and supports different scanner options. The product is currently in beta test and a final release date has not yet been established. For further information, contact Image Access.


 Calendar of Events  Print this article only





Museum Computer Network 2004
November 10 – 13, 2004
Minneapolis, Minnesota

The theme for the 2004 MCN Conference is “Great Technology for Collections, Confluence & Community.” Online registration is available. Conference programming includes workshops, panel sessions, tours, and a keynote address by Max Anderson.

American Society for Information Science and Technology Annual Meeting
November 12 – 17, 2004
Providence, Rhode Island

Tim Berners-Lee and J.C. Herz will deliver keynote addresses at the ASIS&T 2004 meeting. The theme of this year’s conference is “Managing and Enhancing Information: Cultures and Conflicts” and includes several session tracks including: Disciplinary Issues, Digital Libraries, User Behavior, System Design, and Information Organization.

Institutional Repositories: The Next Stage
November 18 – 19, 2004
Washington, D.C.

This workshop will place special emphasis on strategies for implementing and managing institutional repositories. Break-out sessions include topics related to populating repositories, managing copyright and legal issues, digital preservation, policy making, business modeling, and technical solutions.

International Conference on Developing Digital Institutional Repositories: Experiences and Challenges
December 9 – 10, 2004
Kowloon, Hong Kong

International examples of institutional repository implementation will be highlighted in this 2-day conference co-organized by the California Institute of Technology Libraries and The Hong Kong University of Science and Technology Library.

IS&T/SPIE International Symposium Electronic Imaging 2005
January 16 – 20, 2005
San Jose, California

The Society for Imaging Science and Technology and International Society for Optical Engineering’s conference tracks include: 3D Imaging, Interaction, and Measurement, Imaging, Visualization, and Perception, Image Processing, Digital Imaging Sensors and Applications, Multimedia Processing and Applications, and Image and Video Communications and Processing.

2005 NFAIS Annual Conference
February 27 – March 1, 2005
Philadelphia, Pennsylvania

"Whose Mind is it Anyway? Identifying and Meeting Diverse User Needs in the Ongoing Battle for Mindshare" will be National Federation of Science Abstracting and Indexing Services (NFSAIS) conference theme in 2005. The conference will “ focus on the differences and commonalities in the search and retrieval behavior of information professionals/librarians and desktop searchers, and the resultant implications for information providers and librarians who must provide products and services that will meet the needs and expectations of these diverse constituencies.”

ACH/ALLC 2005
June 15 – 19, 2005
British Columbia, Canada

The 17th Joint Conference of the Association for Computers and the Humanities (ACH) and the Association for Literary and Linguistic Computing (ALLC) is inviting proposals for presentations on topics related to computing and information technologies in humanities subjects.


 Announcements  Print this article only





Madison Digital Image Database (MDID 2) software updated
James Madison University announced an updated version of the free, open source Madison Digital Image Database (MDID 2) software. MDID was developed to extend digital library collections into the classroom and other teaching venues. It is an ASP.NET web application using IIS on Windows 2000 or 2003 server and supports MySQL and Microsoft SQL Server 2000 databases. Visit the website to view more information, to try an online demo, and to visit a user group message forum.

Recordkeeping Magazine
The National Archives, UK has combined several of its newsletter publications into a new quarterly magazine called Recordkeeping.  Available online as a PDF file it includes updates about their digital preservation activities and other news items, case studies, and standards and guidance features.

What’s a Wiki?
Brian Lamb pens an extensive review of the interactive Web space format known as “Wikis” in the September/October 2004 issue of Educause.

Fragile sound recording of JFK assassination to be preserved
The US National Archives and Lawrence Berkeley National Laboratory are working to digitize and preserve fragile sound recordings of the JFK assassination. This story was published widely in the popular press, highlighting the need for and benefits of digital preservation efforts.

The State of Audio Collections in Academic Libraries
CLIR (Council on Library and Information Resources) has published the results from their 2003 study of audio recordings in academic libraries. The report “Survey of the State of Audio Collections in Academic Libraries” is available online (in both html and pdf versions) and in print.

Electronics Records Archives Design Contract Competition
The US National Archives has identified the two companies that will participate in a one-year design competition to build the Electronic Records Archives. The companies, Lockheed Martin, Transportation and Security Solutions Division and the Harris Corporation, Government Communications Systems Division have been awarded contracts valued at 20.1 million to develop “a technological solution to the challenge of preserving electronic information across space and time.”

BioMed Central launches repository service
In time to respond to recent calls in the UK and US for making published results of publicly funded research more accessible, BioMed Central has launched a new repository service. The Open Repository promises cost-effective (pricing information is available on the website) and flexible service plans to help institutions build, populate, and maintain repositories.

Guidelines for Digitizing Archival Materials
The US National Archives and Records Administration has posted a major revision of their "Technical Guidelines for Digitizing Archival Materials for Electronic Access: Creation of Production Master Files - Raster Images," June 2004. This is a revision of the 1998 "NARA Guidelines for Digitizing Archival Materials for Electronic Access."

International Internet Preservation Consortium report on web crawling
The IIPC has posted a report, Test Bed Taxonomy for Crawler, that outlines the challenges of automated archival of web-based documents using web crawlers. The 15-page document covers issues such as non-typical URIs, cookies, forms, non-HTML content and links, and robots.txt exclusions.


 Publishing Information  Print this article only





RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org.

Materials in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given to use material found here for research purposes or private study. When citing RLG DigiNews, include the article title and author referenced plus "RLG DigiNews." Any uses other than for research or private study require written permission from RLG and/or the author of the article. To receive this, and prior to using RLG DigiNews contents in any presentations or materials you share with others, please contact Jennifer Hartzell (jlh@notes.rlg.org), RLG Corporate Communications.

Please send comments and questions about this or other issues to the RLG DigiNews editors.

Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); Technical Researcher: Richard Entlich; Contributor: Ellie Buckley; Copy Editor: Martha Crowe; Production: Jenn Demaree, Carla DeMello.


All links in this issue were confirmed accurate as of October 15, 2004.




 
Home  |   About RLG   |  Projects  |  Products & Services  |  Publications  |  Support
Usage Statistics  |  Contact Us  |  About This Site  |  Copyright & Permissions  |  Site Map  |  © 2006 RLG
 
  About RLG home
  Mission & goals
  Members
  Board of directors
  Organization
  Events
  News
  Discussion lists
  Jobs
  Contact us
  Projects home
  Projects by goal
  Current projects
  Past work
  Guides & tools
  Working groups
  Products & services home
  Online databases
  Resource sharing & interlending
  Technical services
  Purchasing background
  Publications home
  Newsletters
  Symposium proceedings
  Books & reports
  Publications order form
  Support home
  Usage statistics
  Service schedules
  LI list
  Support contacts