RLG DigiNews
BROWSE ISSUES
SEARCH
RLG
   
  June 15, 2002, Volume 6, Number 3
ISSN 1093-5371

Table of Contents


Editors' Interview
The Internet Archive, an Interview with Brewster Kahle

Feature Article 1
Automated Digitisation of Printed Material for Everyone: The METADATA ENGINE Project, by Günter Mühlberger

Feature Article 2
Researching Long Term Digital Preservation Approaches in the Digital Preservation Testbed (Dutch Testbed Digitale Bewaring), by Maureen Potter

Highlighted Web Site
Chilling Effects: Monitoring the Legal Climate for Internet Activity

FAQ
Where Are They Now? Digitizing Microfilmed Newspapers, by Richard Entlich

Calendar of Events

Announcements

RLG News
RLG and OCLC Release Two Final Reports


print this article

Editors' Interview

The Internet Archive

Brewster Kahle
The Internet Archive
brewster@archive.org

Editors' Note
The editors interviewed Brewster Kahle by phone on May 15, 2002. Here is an edited version of the interview. Brewster Kahle is the founder and director of the Internet Archive; co-founder of Alexa Internet, an Internet-focused company that concentrates on Web navigation tools and techniques; and inventor and founder of Wide Area Information Servers, Inc. The Internet Archive launched the Wayback Machine, a Web site that provides an interface to the Internet Archive collections, in October 2001.

General Operations

You launched the Wayback Machine a little more than six months ago. How has the actual response compared to your expectations?

We have gotten more usage than I thought we would. We get 20,000 different users each day. There are now parts of the Web that link to the past. What does it mean to have the past as part of the present? The Wayback Machine injects the past into the present. The Web is a self-documenting, self-cataloging machine. Unlike books, with the Web, the catalog and content are part of the same thing, but not completely.

We were first on the Yahoo Internet Life list of top 100 Web sites for 2001. Der Spiegel named us "Hit of the Year," and we received the 3rd Digital Archive Award in Kyoto, Japan. If you look at use by country, the greatest number of hits comes from Japan. We've hit a nerve. People care about their own history. They're psyched about it. On the Web, anyone can be a publisher, now there is a library for their work.

Also, the site has been used in more situations than expected. It has been interesting all around. For example, a group of Masters students at the Berkeley School of Information Management and Systems this spring presented the results of a series of interviews on why people use the Wayback Machine. They discovered that looking up personal Web pages is a major use. "Did I make it into history?" people wonder. After that, organizations look for their sites. I thought people would poke around, have some fun. It provides a reference service.

What kind of organizational infrastructure does it take to run the Internet Archive and the Wayback Machine?

There are currently seven full-time employees and 14 interns at the Internet Archive. It would be an understatement to say that we did everything with that number. We contract with many organizations. Alexa Internet has thirty people, but not all of them are working on the Internet Archive. Compaq actively supports us. It's a ten million dollar per year operation to do the tasks we're currently doing.

Content and Capture

Do the rate of growth and sheer size of the Internet Archive present special technological problems?

Oh, yeah. The dataset we have is a queryable Web collection, which makes it one of the largest databases ever made. There are up to 200 queries per second. That combination, we believe, is unprecedented. Our claim to fame is our level of frugality. We don't have a lot of money.

We need to be willing to use innovative tools and techniques, to try the very newest ideas. The classic technology approach would be to buy a database and implement RAID (Redundant Array of Independent Disks). For 1 terabyte (TB) you'd be looking at $400,000 for EMC disk storage, $150,000 for a Sun box, and $100,000 for an Oracle database. Effectively, that would be almost a million dollars for 1 TB. Like Google and Hotmail, we use a large number of PCs in parallel. Our technology has to be state-of-the-art to keep up. We collect 10 TB per month. Our approach has to be cost-effective and flexible. For instance, we moved off tape to removable hard drives.

You have a number of special collections on your site, such as Election 1996. Do you have an ad hoc approach as subjects of interest surface or do you have a collection plan? The Internet Archive FAQ encourages donations of collections. Will your special collections program expand?
All our special collections have been done in partnerships. Someone else helps with the curatorship, selection and quality assessment. The 1996 Election collection became a kiosk in the Smithsonian. They thought it possible that the Web might be the next "bumper sticker" in campaigning. The Election collection grew into a big project with the Library of Congress for 2000. There were lots of questions—"could you?" "should you?" "are you allowed to do this?"

We see special collections growing through partnerships into national collections. It is inconceivable for countries not to record their digital heritage. A lot of history is born digital. This should not be like early television where there is not a record.

Your policy is to collect Web pages that are publicly available. Do you have any estimates of how much of the Web is inaccessible to your crawlers?

The Web is effectively infinite. If you start crawling, you'd never end. How big is our collection? Enormous—really big. We want to increase access—researchers' access—and have them help guide the collection. We have proven you can do something and it can be useful. We already have some of the deep Web—where the crawl can generate each page for capture. What we want now is complete collection of important things. We want more critiques of collections from researchers, historians and scholars—like "'it would have been great if…" These comments will inform the collections.

The Internet Archive FAQ indicates that the addition of a robot exclusion to a Web site will "lead to the removal of the pages from the Wayback Machine." What amount of content has been removed as a result of this policy? How is the removal documented?

We check for robot exclusions and retroactively exclude content for those sites. There are some classes of people that have opted out. Many newspapers do not want to be included. Individuals are all over the map. Photographers feel disadvantaged; they don't feel that being part of it will help them. This represents a very small percentage. We do not document the removal of content, but we collect robot exclusions. At first, we didn't collect them. Ben Edelman, a researcher at Harvard Law School, is looking at the relationship between domain name ownership and robot exclusions.

Web Crawlers and Access

As of October 2001, the Internet Archive offered access to roughly 10 billion pages. How long does it take to crawl the entire Web? Do you give any preference to crawling particular portions of the Internet more often or more thoroughly? If so, how?

There are more pages now, but we do not have a more recent number. The Internet Archive receives most of its Web pages from donations from Alexa Internet. They do a new crawl every two months. We crawl for special collections weekly or daily. If we discover a new page, we crawl it within 24 hours. We do a complete sweep every two months.

 

A powerful Web crawler must be essential to the development of the Internet Archive. Could you describe how your Web crawler support has changed in response to the evolution of this technology since you began the Internet Archive in 1996?

The crawler technology is constantly being rewritten. Every 12-18 months Alexa throws out the existing crawler and starts over from scratch. Revisions to the crawler reflect changes in the structure of the net, its size, the underlying technologies, etc. What we have now is the best available crawler.

In searching for specific pages/sites on the Wayback Machine, we've noticed varying gaps in content, sometimes within a page, e.g., missing images, sometimes within a site, e.g., missing pages, and sometimes over time, e.g., long gaps between capture dates. What strategies have you developed for crawling to maximize capture and to validate the content that is captured, or do you use a passive capture approach?

All of the laundry from the past is shown to everybody in this collection! Some gaps were the Internet Archive's fault and some were not. Some missing images are due to robot exclusions. The technology has changed. From 1996-1998, the crawler crawled a full Web site or as many pages as it wanted in one day, so there'd be a clean copy. Other times, it might follow up later—many days later. Different crawl philosophies were used. The 1999 crawls do not contain a lot of images because we did not have enough bandwidth for text plus images. There were months when there was no crawling at all while the crawler was being rewritten.

Your policy is to provide "free access to the Internet collections to researchers, historians, and scholars through an account on a Unix server." Do you allow Web crawlers to access the Internet Archive? Does the Wayback Machine provide access to all of the Internet Archive content?

Our terms of service are that users respect other peoples' privacy and copyright. You could interpret that as "don't copy from the Wayback Machine or the Unix archives collections without talking to the Internet Archive." We have robot exclusions on the Internet Archive collections. We do make case-by-case exceptions.
Does the Internet Archive support the Open Archive Metadata Harvesting Protocol?

We haven't supported OAI (Open Archives Initiative) yet, but we want to. We work with users. We want to be as standards-based as possible as long as people would use it. We are willing to be an OAI data provider or service provider.

Digital Archives: Technical Issues

We have a series of preservation-related questions for you. First, has the Open Archival Information System (OAIS) reference model influenced the development of the Internet Archive? If not, is the Internet Archive based upon a digital archive model?

I guess not. Should we? We would like to comply with prevailing standards. We are currently reviewing the RLG/OCLC Trusted Digital Repositories white paper.
Your FAQ on storage and preservation indicates that you will migrate your storage media but use emulation for your file formats. Did you consider file migration? Are you exploring other options? Are you working on any research in these areas?

Once archived, we never change a page. The Web wasn't constructed to be archived. It's so interconnected. A book exists outside of time. Archiving Web sites is like putting together a bomb after it has exploded. We do as good a job as we know how. If anyone knows how to improve it, please let us know or help us change it. We can't force people to archive in a certain format. The Wayback Machine is the best way we know to look at the results.

File format changes haven't become a problem for us yet. Most old HTML is still supported. Sometimes we copy files forward. We store things in the exact way they were gotten. We are very careful at ingest to record all metadata. When using the Wayback Machine the links on a page are changed on the fly to point into the past. You need to watch the URL when you're surfing the Wayback Machine to answer the question—"When was this page archived?" We tried putting a frame around the page, but that is technically hard to do. Users, we are afraid, forget to look at the URL, but we don't know a better way.


Figure 1. JFK Library Web page captured August 15, 2000 points readers to news releases from 2002 via the "What's New" link because the frame is from the earlier date and the news page is from the current date. The frame is using referential linking to the current page.
(Click on image to enlarge)

Your preservation FAQ references the ARC format proposed by Alexa Internet as an Internet object archiving standard. Is Alexa trying to make this a formal standard, and if so, through which standards body?

We haven't approached a standards body at this time. It's a lot of work. I was in charge of an IETF group. I found that a couple of people working together works best. Put something together and throw it out there. If people pick it up, it's worth going through the headaches. Some others are doing Web archiving, but with their own tools. We have started active outreach to other projects to see if we can share tools, formats, and experiences.

We understand that you are trying to maintain multiple copies of the entire set of collections. Are these mirror sites? Does the amount of content present special technical issues for establishing redundant storage?

Mirrors? Yes. We donated a complete copy of the collection to the new Library of Alexandria. I spent April 2002 in Alexandria, Egypt. It was a kick to see the machine it's on—it's front and center when you walk in. The best method for preservation is replication. Put copies in as many different contexts as possible—one in RAID, bury a copy. Put a copy in someone else's hands. They will care for it differently, and take care of some bugs. Usually books are destroyed because the new regime is not interested in the old. We learned a lesson from Alexandria; we want to place duplicates on other continents. Redundancy is antithetical to current culture. Libraries often strive to create unique collections that are proprietary. We are less proud of collected materials if there are other copies. We need to show positive examples to change the way things work.
Digital Archives: Financial Issues

Your Web site states that the Internet Archive was founded to "build an 'Internet library' with the purpose of offering permanent access" to the content. Have you established a sustainable funding model for the Internet Archive? Is there a business continuity plan that will ensure ongoing access to the collections?

Yes, we have a sustainable funding model. Everything that has been donated is safe. There is enough money to make sure that it's safe forever. Considering Moore's law—what we have is sustainable for the present, but access is more expensive. We need to blend with the research community. We need to put our hand out for support to further the work on access.
We noticed the monetary donation button on the Internet Archive site via Amazon.com. Is this similar to NPR's pledge drives? What kind of response have you gotten? Do you use those funds for specific purposes?

There has been almost no response to pledge unless you badger. The real money has come from private foundations, government support, and in-kind donations from corporations—Zoe Baird and the Markle Foundation, NSF, and LC. We need to expand the list by making this part of peoples' lives, to open new doors.

Legal Issues

Can you discuss your recent brief for the Supreme Court case, Eldred v. Ashcroft? Do you anticipate the Internet Archive becoming more active in such cases?

We are not bringing the cases. People are using the Internet Archive as an example of how they see the future. Peter Lyman suggested in his paper "Why Archive the Web?" that the Internet is the information resource of first resort for millions of readers. Some say the good stuff is not on the Net. I have a sinking feeling that if the young are learning from the Web, and the Web is not offering the best, that we are shortchanging our youth. This is criminal. It makes no sense. The expectation that materials for learning should be on the Web brings urgency to the issue. We—the establishment (face it, we are)—are screwing up. We do not need more meetings on metadata. We need to get our cultural heritage easily accessible on the Web. Newspapers are up there. Academic journals are becoming available. Books, music, videos, and TV are trying but they are suing their way into the next century—or the last one. That all this stuff ends up in the courts is dumb; that approach doesn't make sense. Raj Reddy promotes "universal access to human knowledge." Some for free and some for pay. If we take as our premise that all information should be at our fingertips, how do we get there? The public library system is a $25 billion industry. $5-6 billion goes to publishers. If students aren't consulting with the Library of Congress daily then let's get on with it. If we wait for 20 years, we could shortchange a human generation. Kids could grow up without the best works of the 21st century. Our libraries contain large amounts of pre-20th century materials which could all be online now because there are no rights issues. Raj Reddy says we are moving too slowly. We are trying to do the million books project. A million books for free, great. India and China are in, great. The US, not really—Cornell, Math—these are dribs and drabs. Kids are demanding a different way. RLG and the research libraries—these are rich institutions and they can be doing much more. The technology is not hard. Raj Reddy says it takes 2-4 hours to scan a book. Not the most interesting 2-4 hours, but then that book can be offered to the world. Gutenberg showed how it works. We need to do our job.

Future Plans

How do you measure success for the Internet Archive? What are your plans? Do you have specific goals that you would like to achieve?

We want to grow our collections, but grow them in ways that they are useful to traditional library users—researchers, scholars, and the underserved. On the Web we can put tools and technology on top of collections—a search engine to answer harder questions. We can bring tools and information together in new ways that weren't possible before the Web. We think we have an archive, we want to build a digital library. We need partnerships to do this. We have collections but not a lot of finding aids. The top-level best thing for the community is universal access to human knowledge. It is within our grasp. We need to coordinate our efforts and just do it. I leap out of bed in the morning fueled by this possibility. We have the ability to touch people—like poor kids in Yemen. When the goal sinks in, every day is surprising. It straightens the road, it's not curvier. We can leave a legacy which will make our grandkids proud of us. This is not fanciful. We have technology for storing, for accessing. We have the political will to live in an open society. Libraries are now stepping up to the challenge.


print this article

Automated Digitisation of Printed Material for Everyone: The METADATA ENGINE Project

Günter Mühlberger (1)
University Library Innsbruck
guenter.muehlberger@uibk.ac.at


The European Union R&D project METADATA ENGINE focuses on the digitisation of printed material such as books and journals. The project comprises 14 partners from 7 European countries and the US. Some of the libraries among our partners play leading roles in the field of digitisation, including the National Library of France and Cornell University Library (2). The project is co-ordinated by the University of Innsbruck. It started in September 2000, and will be finished by spring 2003. The main objectives of the project are to:
  • make digitisation more effective in terms of costs and resources needed
  • automate the whole conversion workflow and especially metadata capture by applying layout and document analysis algorithms
  • provide a standardized output that is compliant with emerging standards
  • increase the added value of digitally reformatted material
These objectives will be realised by developing a comprehensive, extensible, and easy-to-use software package, the so-called METAe engine. The software will be commercially available after the end of the project and distributed by the German software house CCS GmbH. The following paper describes the main features of the software, gives some explanations regarding its technological background, and outlines some of the expected results and benefits.

Why METADATA ENGINE?

The basic approach of the project is to automatically create and record as much administrative, descriptive and structural metadata (3) as possible during the conversion process. Using the METAe engine, the routine workflow will result in a full description of the digitized document. The following table gives an illustration of the metadata gathered during digitisation:

 

  Available data Descriptive metadata Administrative metadata Structural metadata— logical Structural metadata— physical
Formats e.g., MARC records TIFF images METS
Dublin Core
METS
DIG35 (partly)
METS
Structural map
ALTO (Analyzed Layout and Text Object)
METAe engine Imports the whole record or just a sub-set of data. Provides a linking from METS to MARC Creates descriptive records for articles, pictures… Records metadata Suggests labels for logical elements and structures Provides suggestions for physical structure
User mode Fully automated Semi-automated with correction recommended Fully automated for technical metadata, semi-automated for other administrative data Fully automated with correction recommended Fully automated with correction only for special cases

Table 1. Metadata Creation During the Conversion Process


The conversion process begins with page images (e.g., TIFF files or other formats), that are scanned with the METAe engine or that are already available on a file system. At the same time, existing descriptive metadata from MARC records can be imported and integrated directly into the workflow.

The first step in metadata creation is to record administrative information such as the type of scanner, the file format, the date of acquisition, the person who has carried out the scanning, etc.

The next step is to create structural and descriptive metadata for the content of the converted document.

Structural metadata are recorded from a physical as well as from a logical point of view (4). From the physical point of view, we are concerned with such questions as: How are the bitmaps distributed over a given page image? Do they belong to textual or graphical zones? What coordinates do these elements have? At a more detailed level, we can ask: Where are the zones, lines, or even words, within the page image? What font size does a word have? Which alphabet (Latin, Greek, black letter, etc) is used? Once the conversion process is completed, this physical view will, in principle, allow a 1:1 reconstruction of a given document.

However, the structural metadata connected with the logical or intellectual dimension of a document is much more important than the physical description. The body text, with paragraphs, footnotes, margin notes, appendices and the like, forms the intellectual content of a book and needs to be recorded in detail. One of the main features of the METAe engine is its ability to create structural metadata automatically, based on a systematic analysis of the document and its layout.

An article in a journal may contain text, photographs, and drawings that are all part of the structural map of the document, and yet these components may also be valuable intellectual items in their own right and therefore should be described separately as well as in the context of the larger document.

For each document, all metadata elements, as well as their relation to each other, are recorded in the internal database of the METAe engine. This database produces a generic XML output file that can be configured to serve the particular needs of a digital library management system. Even so, the project team decided to support at least one preferred output schema, and has voted for the METS schema (5).

The standard output file of the METAe engine is therefore designed in the following way: The METS file is the surrounding bracket within which descriptive data is either referenced, e.g., to an existing MARC record, or labelled according to Dublin Core. Administrative metadata follows, in some respects, the specifications set up by DIG35 (6). The structural metadata, on the logical level, is formed according to the guidelines provided in the METS schema. The metadata describing the physical dimensions of the document is stored separately, using the so-called ALTO (Analysed Layout and Text Object) file, the structure of which has been developed by the project team (7).

Architecture

Workflow Component

The METAe engine consists of a workflow component and a database of rules. The workflow component and its related interfaces enable the user to carry out the whole digitisation process, including scanning, image-processing, physical and logical analysis, quality control, configuration, and administration. A graphical user interface (GUI) allows the user to verify and to correct all automatically processed metadata. The workflow can be configured in a flexible way, i.e., by doing one procedure after the other, or by checking selected pages and elements at crucial steps of the process. The engine will run on the Windows platform either on a single workstation or in a network environment.


The main user interface for interacting with the system consists of two parts, the first of which is a frame for displaying the physical and logical structure of a document, as indicated in Figure 1 below. The document can be browsed on the level of hierarchies, or on single elements. In the second frame, the physical representation of the logical element is presented. In the case of a chapter, all pages relating to the chapter are displayed. In the case of a picture, the related page is shown. In order to provide a better context, the elements are highlighted with different colors, e.g., yellow for running text, green for a picture.

Figure 2. Thumbnail of a screenshot of the META
e GUI

Figure 1. Thumbnail of a screenshot of the METAe GUI
(click on image to enlarge)


Database of Rules

The second module is the core of the METAe engine. It is not visible to the user and consists of a database of rules designed to automate the digitisation process. In order to create effective rules, a "grammar of books and journals" has been set up. Our approach (8) is based on the assumption that documents are semiotic systems with a special syntax that can be modelled by applying rules derived from the layout of books. Even though the METADATA ENGINE project focuses on books and journals, it is obvious that the database of rules can be extended to other documents such as flyers, newspapers, manuscripts, magazines, posters, handbooks, encyclopaedias, or finding aids. At present, all basic rules are being implemented in the database. Our first results are highly encouraging. Data about the effectiveness and performance of the METAe engine will be available once the validation phase of the project has been completed.

Features

Cropping and Splitting of Pages

Although it might be a good decision to cut bound documents and to scan the leaves of a book one by one on a flatbed scanner, not all libraries will vote for this option. An alternative might be found in overhead scanners or in a completely automated scanning machine (9). These scanners will provide double page images that look more or less like the following:

Image showing the cropping of singe pages after scanning with a planetary book scanner.

Figure 2. Cropping of Single Pages by Utilizing the Print Space of Books


These double page images need to be split, the single pages cropped, and, optionally, they may be adjusted and deskewed. In the METAe engine the whole process will run automatically. Books are printed according to a clearly defined printing space. Only a limited number of special elements may appear in the surrounding margins. Therefore, the engine will first determine the coordinates of the print space used in the given document and then apply this zone to the actual page image. Next it will add a virtual margin around the printing space and cut the pages. If a document contains supplements that do not conform to the default print space, such as maps, tables, graphs, and pictures, this variation will be detected automatically.

Dynamic Binarization

The most important step in the digitisation process is to create the best image file possible, since it will constitute the basis for all further steps. Both the METAe layout analysis and the OCR engine rely on good image quality. In accordance with the guidelines recommended by the DLF (10), there are some circumstances for using 300-400 dpi grey-scale (8 bit) scanning instead of 600 dpi b/w. Grey-scale scanning may provide better results in the OCR and layout analysis process in the METAe engine, by applying a dynamic binarization feature that handles different parts of the page image at different thresholds. This improves the recognition rate for OCR remarkably. We also have to take into account the fact that, from the 1880s onwards, many documents contain halftones that cannot be digitized in a satisfying way unless grey-scale or color-mode scanning is conducted (11). In the example below, the big advantage of the METAe engine is that the whole book can be scanned in grey-scale in one single pass. The detection of the images and the dynamic binarization of the textual zones will be done afterwards automatically by the METAe engine.

Image showing process of binarization: 1. scan in grayscale or color mode; 2. extract pictures or graphs; 3. keep text areas as black and white and images
 as gray or color.

Figure 3. Dynamic Binarization and Detection of Graphical Zones


Matching of Page Numbers and Image Files

In order to support the basic features of a digital library Web site, correct matching between image files and page numbers is imperative. However, page numbering is rather complicated, as in many instances there are pages within a book that are not counted, and others that are counted but do not show numbers, or that show roman numerals.


As mentioned above, it is the document as a whole, i.e., its overall syntactical structure, that is analysed by the METAe engine, allowing for a highly automated solution for matching images and page numbers. The engine will first find out where the page number is usually located on a page, then the whole row will be extracted, and after that the right sequence will be reconstructed. Pages that have been counted but do not display page numbers can have page numbers added automatically. Missing pages can be detected and marked with a placeholder, and also brought to the attention of the operator.

OCR Processing

The OCR engine is a distinct module within the METAe engine. Every kind of OCR engine available, or even more than one, might be used. Nevertheless, it is one of the objectives of the project to develop an OCR engine with improved recognition rates for historical documents. Typefaces used between the 16th and 19th centuries are, in many instances, considerably different from those in use nowadays. Since OCR engines have been trained for modern typefaces, this fact will lower the rate of correctly recognized characters for older documents. Moreover, the vast majority of all printed historical documents in central Europe were set in the German variant of the black letter font, Fraktur.


image showing fraktur black letter fonts.
Figure 4. Black Letter Fonts (12)

Currently, no OCR is capable of reading these characters without training, which is a major drawback for all digitisation projects in Europe. One of the leading companies in OCR technology, ABBYY Europe is responsible for providing this missing link within the METAe project. Since OCR engines rely heavily on background dictionaries, these dictionaries will have to be supplemented with historical forms and words no longer in use. The ABBYY engine shall be available as part of the METAe engine and as a separate commercial product (13).

Segmentation and Hierarchical Ordering

The main feature of the METAe engine is its capability for automatically labelling books and journals according to their logical structure. Among the elements that can be detected are page numbers, running titles, chapter headings, titles, footnotes, margin notes, and paragraphs. Moreover the METAe engine will extract the hierarchical structure; e.g., chapters of a book, or issues and articles within a journal. For the segmentation of the documents, the METAe engine utilizes the fact that most books exhibit internal consistency; i.e., all headlines at a certain hierarchical level are expressed with the same type and style (bold, centered, etc). If there are sufficiently accurate results from the (physical) layout analysis, it will be possible to find similar elements, group them, and apply labels (14). This feature will be especially helpful for documents, including journals and magazines, that contain a number of single intellectual items that might be recorded individually.


Added value and benefits

Cleansed Body Text


One might ask what advantage the detailed labelling of minor elements such as headlines, footnotes or page numbers might have. In order to explain why we believe that this is one of the most innovative features of the METAe engine, we need to understand that books are composed of different functional layers. One layer is what copyright law knows as "the work", i.e. the intellectual item that exists independently from its concrete presentation. Another layer serves the need of the reader to navigate through a book. Elements of this layer include tables of contents, volume indexes, and running titles. Still another functional layer shows an advertisement that has nothing to do with the intellectual work, but might have a value on its own. In general we can say that many elements found in paper-based books are not needed any more in the electronic environment. Many elements that are helpful in books are noise from the point of view of accessing electronic text. This idea can be illustrated with the following example.

image of a page from a scientific journal

Figure 5. Page Image from Scientific Journal (1930)


Figure 5 shows a typical page from a scientific journal from 1930. There is a running title, a page number, graphs, caption lines, a footnote, and a signature mark. The output of a pure OCR engine is shown in figure 6.

Figure 6. Raw OCR Text of the Page Image of Figure 5


This electronic text is not readable and cannot be presented to an end-user since the raw OCR text has a flat structure and contains the complete text independently from its hierarchical level or logical value. Assuming that the METAe engine has correctly labelled all elements on this page image, it will be easy to design a Web application where only the intellectual item—such as the article shown above—is presented to the reader and where the other elements, such as the running title or page number, are either not shown or are presented in an adequate way, e.g., footnotes laying in the back of the text. This "cleansed" electronic text must not be mixed up with a real corrected OCR text, as is usually done with double keying procedures.

Figure 7. Cleansed instead of corrected OCR text
(click on image to enlarge)


In figure 7 we see that the "real" OCR errors of this page from 1930 (scanned at 300 dpi, 8 bit grey-scale) are very rare and are no obstacle to presenting the uncorrected OCR text to the reader. In fact only one real OCR error can be found in the running text (apart from the caption line which might be corrected manually since it will also form the title of a Dublin Core record for this illustration). Obviously this cleansing process will not only be carried out on single pages but will also include noise reduction at the document level.

The cleansed full-text will open up new avenues for use. It might lead to a new presentation model for digitized documents on the Internet; e.g., the cleansed full text in the front, and the page image (or parts of it) in the background. It might also lead to new products and some potential commercial benefits for libraries. For example, publishers of e-Book collections will be able to provide their users with millions of cleansed (albeit not corrected) text pages. In the rare case where a user really needs to check whether a word is correct or not he will still have the chance to access the page image on the Internet.

Book Collections as Picture Collections

Another simple but effective benefit has to be mentioned here as well. From the 1880s onwards, more and more printed documents contain pictures and halftones. In the case of such illustrated books or journals—the "Garden and Forest" collection at the Library of Congress (14), for instance—the text collection will also serve as a picture collection. The page images are kept in grey scale, their caption is labelled automatically, and so is their location within the original document. For the user this will mean that it will be possible to search within all caption lines of a collection and to retrieve just the pictures:


Figure 8. Book Collections as Picture Collections: An Example from Garden and Forest

Digitisation as a Permanent Service

We are convinced that the METAe engine will give libraries the opportunity to create new and effective business models for digitisation. The key for this expectation is that with the METAe engine, digitisation will become much simpler than before. Input and output will be highly standardized, the vast majority of processing steps will be done automatically in the background, and the operator will be needed only for quality control and correction. Such a digitisation process will be easier to establish and libraries might be able to integrate it as a permanent service into their service portfolio. Libraries might provide digitisation on demand, or digitisation of rare books for the needs of a course or a research project.


Conclusion

The project team is convinced that the METAe engine will provide a feasible tool for in-house digitisation of library and archival collections. In order to gain experience from real world applications, the METAe engine will be installed at several METAe partner sites during the fall and winter of 2002. In the first months of 2003 a report will be released about the performance of the engine and best practise models for using it in ways that best fit the needs of libraries and their users.


Footnotes
(1) This is a summary of the work jointly carried out by the participants in the METADATA ENGINE project. I would, nevertheless, like to add special acknowledgments for the following persons: Michael Day, Alexander Egger, Paolo Frasconi, Claus Gravenhorst, Kurt Habitzel, Juha Hakala, Marco Kvttstorfer, Simone Marinai, Gregor Retti, Oya Rieger, Birgit Stehno, Jupp Stvpetie, Simon Tanner, and Ralph Tiede. (Back)
(2) Partners of the project are: University Innsbruck (co-ordinator), Austria; University of Linz, Department for Applied Informatics, Austria; Mitcom (Abbyy Europe) Neue Medien GmbH, Germany; CCS Compact Computer Systeme, Germany; University Alicante, Spain; Friedrich-Ebert Foundation, Germany; Cornell University Library. Department of Preservation and Conservation, USA; Bibliothhque Nationale de France; The National Library of Norway, Rana division, Norway; Biblioteca Statale A. Baldini, Italy; Dipartimento di Sistemi e Informatica, University of Florence, Italy; University Graz Library, Austria; Scuola Normale Superiore, Centro di Ricerche Informatiche per i Beni Culturali, Italy; Higher Education digitisation Service HEDS, UK. (Back)

(3) Cf. Library of Congress Digital Repository Development. Core Metadata Elements.  (Back)
(4) Logical and physical levels are always closely linked. A sharp separation might therefore lead to "artificial" and "peculiar" results. The team prefers to regard them as different perspectives of the same subject. (Back)
(5) The reasons for taking the METS schema are manifold. To mention just a few: Firstly, METS emerged from the MOA II white paper and has therefore not been developed from scratch but has a strong practical implementation aspect. Secondly, it has an open and flexible structure and, thirdly, it is publicly available at the Library of Congress, and it is, above all, well described. (Back)
(6) International Imaging Industry Association. DIG 35 Initiative Group. (Back)
(7) A draft document of the ALTO file is already available. After the testing and validation phase the ALTO file will be described in more detail and published on the METAe project homepage. (Back)
(8) C.f. Stehno, Birgit and Retti, Gregor: Modelling the logical structure of books and journals using augmented transition network grammars. In: Journal of Documentation. (paper will be edited in 2002). (Back)
(8) C.f. URL: http://www.4digitalbooks.com/. (Back)
(9) Benchmark for digital reproductions of monographs and serials. As endorsed by the DLF (January 25, 2002). (Back)
(10) The library might decide to store textual zones as 1 bit files in order to keep the file size low.(Back)
(11) Black letter fonts for the electronic environment are provided by: Ligaturix - der Frakturkonverter. A collection of different black letter fonts can be found at: URL: http://www.fraktur.com/.  (Back)
(12) Cf. a METAe project paper on black letter fonts: URL: http://heds.herts.ac.uk/METAe/Articles/art04_2.htm  (Back)
(13) The natural limit of the automated process has to be mentioned here once more: If there are intellectual structures in a work which do not have a recognisable representation in the layout, the engine will not be able to recognise them automatically. (Back)
(14) Garden and Forest: A Journal of Horticulture, Landscape Art, and Forestry (1888-1897). A joint project of the Library of Congress Preservation Reformatting Division, the University of Michigan Making of America project, and the Arnold Arboretum of Harvard University. (Back)

print this article


Researching Long Term Digital Preservation Approaches in the Dutch Digital Preservation Testbed (Testbed Digitale Bewaring)


Maureen Potter
Digital Preservation Testbed, Netherlands
Maureen.Potter@ictu.nl


In 1996, the Netherlands Ministry of the Interior and the Ministry of Education, Culture and Sciences initiated a collaborative programme entitled Digital Longevity (Digitale Duurzaamheid). This programme, run in conjunction with the National Archives, sponsored Jeff Rothenberg's 1999 publication, Carrying Authentic, Understandable and Usable Records Through Time, which proposed establishing a testbed to carry out research into possible approaches for the long term digital preservation of archival records (1). The Digital Preservation Testbed (Testbed Digitale Bewaring) was born the following year.


This article introduces the work of the Digital Preservation Testbed. It first places the Testbed in context within the rest of the Digital Longevity programme and defines the scope and goals of the project. Our objectives and research questions are identified, followed by a review of the rigorous scientific approach that the Testbed takes in its experiments. The benefits of this are highlighted, as is the practical nature of the Testbed. Finally, the products and deliverables that are expected to emerge throughout the course of the project are discussed and identified.

Background and Scope

The Digital Preservation Testbed is part of a wider network of initiatives that the Dutch government has established to deal with the challenges posed by the electronic era. The Testbed belongs to the Digitale Duurzaamheid Programme, whose overall aim is to guarantee the accessibility of information held by the government in digital form (2). Three other projects complete the Digitale Duurzaamheid programme: the RecordKeeping System (RKS) project, establishing guidelines and providing advice to Dutch Ministries on the selection of an RKS; the Kwaliteitzorg, concerned with ensuring the quality of the records being produced electronically; and the Taskforce DigitaleDuurzaamheid, whose main aim is to raise awareness of the digital longevity issues throughout government. The goal of the Testbed within the Digitale Duurzaamheid programme is to help achieve the lasting accessibility of government information in digital form. The Testbed will provide advice that is tailored to the situation here in the Netherlands. Our focus is on the preservation of electronic records for the long term, and our strategy begins with preparing for the preservation of records from their point of creation. Our intention is to ensure the reliable creation and management of electronic records so that they are in a suitable state for long-term preservation action. The Testbed is running controlled experiments to explore options for long-term preservation approaches and the advice on these will be issued to the Dutch government later this year.

Our research is initially limited to four main alphanumeric record types: text documents, email messages, spreadsheets and databases, all of which are widely used within ministries and government organisations. Three preservation approaches are under consideration: migration, emulation, and XML, which are discussed in more detail below. Four record types and three approaches result in 12 possible combinations. This initial set concentrates our resources and limits what is otherwise an exponential and unstructured research area. Also, not every record type is suitable for every preservation approach. For example, we do not consider it to be worthwhile to attempt emulation for emails. Email packages rely upon standard exchange formats that enable email systems to be interoperable. The sender and receiver will often perceive the look and feel attributes of a message differently. The question then becomes: "what exactly am I trying to preserve?" You need not preserve something that was not present in the first instance. Emulation is thus not the best match for a preservation approach to this record type.



An integral assumption in our research is that different record types have different preservation and authenticity requirements. Records ingested into the Testbed are analysed in terms of the five attributes posed by Rothenberg for digital records: Context, Content, Structure, Appearance, and Behaviour. Authenticity Requirements are developed for each record type in terms of these five attributes and act as the success criteria for an experiment's preservation approach. In addition to this, other preservation and archival issues arise from these experiments. Our research thus extends to wider considerations and includes objectives beyond the success of a technical strategy.

Objectives

The objectives of the project are to provide insights into:

  • Technical solutions for the preservation of authentic electronic records
  • The effectiveness of current and potential preservation approaches
  • Authenticity features of digital records
  • Cost factors for storing, preserving and managing digital records and associated metadata
  • Management processes and activities required to capture, generate and maintain metadata that support the ingestion of records and preservation of long-term access to authentic electronic records
Research Questions

The Testbed Research Framework translates these objectives into a clear set of research questions that are refined and updated throughout the duration of the project. These range from fundamental research questions that require comparing the results of large groups of experiments, to questions focusing on the role and significance of record features, attributes, and metadata that may be answered by individual and smaller groups of experiments.

Fundamental Research Questions include:
  • What are the advantages and disadvantages of implementing each of the specified preservation approaches?
  • What are the factors that affect the effectiveness or appropriateness of a particular preservation approach?
  • What are the basic requirements for preservation functions?

These questions can be considered in light of cost, record type, authenticity requirements, and supporting resources, to name but a few.

The subset of Attribute research questions includes:
  • What are the options for preserving record attributes?
  • What factors affect the preservation of record attributes?

These questions consider attributes in terms of record type, software, preservation approach, metadata implementation, and preservation function implementation. Defining essential preservation metadata is also a priority of the project, and is included in other research questions.


Preservation Approaches

As the Testbed Research Framework notes, record keepers have not yet defined an explicit and definitive methodology for any of the preservation approaches we are considering. The Testbed is contributing to the delineation of various appropriate methodologies. As discussed elsewhere, there are various ways to implement a migration strategy (3). The same can be said for emulation and the vaguely defined "XML approach." Indeed, there are so many different ways to formulate a strategy that the boundaries between them can become indistinct (4). Let's begin by defining the various approaches: migration, emulation, and the XML approach (5).


The Testbed definition of migration is a relatively simple one: the transfer of records from one hardware and software configuration to another. This includes migration through and over generations of application versions, as well as across applications and operating systems, in practice often on either proprietary or part-proprietary software, but excludes media refreshing.

Closely related to this is the XML approach. Conversion to XML (or conversion to standards in general) can be considered to be a form of migration. However, XML is not tied to any particular software system and is often regarded as the most promising present day format for archiving and interoperability. It has a multiplicity of uses, and so deserves to be considered as an approach in its own right.

For the emulation strand of our experiments we will be working with IBM and will perform experiments with their UVC (Universal Virtual Computer) on archival records at the Testbed. The UVC approach is described in an earlier RLG Diginews article. The UVC is a variation of the emulation approach that addresses the problem of future interpretation of data files by writing a program to carry out the interpretation in the language of a Universal Virtual Computer. This strategy can be extended to archiving the program as well, making it more like the full emulation approach identified and proposed by Rothenberg (6).

Methodology

The Testbed is a practical project that runs controlled experiments in a secure environment to study potential preservation approaches, and to ascertain the effects of preservation actions on different file formats and record types. In order to provide valid and reliable results, the Testbed has established a rigorous experiment process, that clearly articulates the requirements for each experiment.


An experiment process is defined around a specific preservation approach and record type, e.g., migration of text documents, or conversion of emails to XML. Each process consists of 12 documented stages. Once a process has been defined and run for the first time, further experiments can be run on that process using its generic requirements and procedures.



Figure 1. Flow chart of Testbed Project Experiment Process

For example: Experiment process 1 caters to the migration of text documents. The first five stages in the process are general stages concerned with the broad requirements of the preservation approach (migration) and record type (text documents). These stages define the exploration area, identify relevant background literature, specify authenticity requirements and evaluation criteria, develop an overall experiment process design, and estimate the required resource specifications. The remaining stages are then specific to an individual experiment; that is to say, they specify the records to be used in that experiment (e.g., Microsoft Word 95) and the specifics of the approach (e.g., migration to PDF). Further experiments can be run using the generic documents from the first five stages of the Process. Several further experiments can therefore be run using one experiment process. These may examine different formats of the record type (e.g., Word Perfect or Microsoft Word 2000 as text documents) or they may examine metadata issues.


There are several advantages to this approach. The controlled, well-documented experiment process allows each experiment to be easily reconstructed in the future and allows experiments to be re-run to confirm or check results, if necessary. The process lays out requirements to be considered at each stage, from preservation requirements to functional requirements, metadata requirements to authenticity requirements. The iterative nature of the process allows us to produce a base set of documents for each set of related experiments, thus eliminating any duplication. Grouping experiments in this way also helps us to consider the combined results of sets of experiments and the overall "success" of each preservation approach. The secure and controlled Testbed environment, in which all of the experiments are carried out, ensures that our experiment results are valid and free from errors from extraneous sources. We carry out regular control and null experiments to ensure this remains consistent.

The practical nature of the project brings other benefits as well. The experiment process covers all of the key points in the history of the record, from creation (by way of model records (7)) to capture, appraisal, and long-term storage. These aspects can have unexpected effects on the implementation and success of a preservation strategy. The Testbed considers all of these aspects within the scope of the experiment process, and they have yielded useful and interesting results.


Preliminary Results (8)

Our first experiments concentrated on the migration of text documents. This combination was chosen for the first experiments as the use of text documents is widespread throughout government and many organisations already carry out routine migrations when updating their computer systems. Our main goal in the early experiments was to examine and identify record features that changed as a result of the migration process.


Microsoft Word was identified as a good starting point. It is one of the most widely used word processing applications available and is used by many agencies to produce government records. Experiments have taken place on the migration of text documents through and over generations of Microsoft Word using both model and test records. Migration through generations refers to migrating through successive versions of an application, e.g., from Word 95 to Word 97 then to 2000 and then to 2002. Migration over generations skips the intermediate versions and goes straight to the current highest version, e.g., from Word 95 to Word 2002. We have experimented with migration through and over generations of Word 95, Word 97, Word 2000 and Word 2002. We have also experimented with migration of Word files to PDF 1.2, 1.3, and 1.4.

Results of the model record Word experiments showed that if the record had been created well initially, it stood a far better chance of retaining its features through and after migrations. Fields that update automatically (e.g., date fields) and that were not fixed after document creation wound up being updated every time the document was accessed, thus altering an essential content and context-reference feature. This is a problem whilst the records are still in active use, let alone when they reach the archives. However, most features migrated successfully. The position of the text on the page was sometimes different, but colour, paragraph and font formatting, bullets and numbering, inserted and well-formed tables, hyperlinks, pictures and diagrams were all successfully retained in the experiments we have carried out so far (9).

Use of records donated by government ministries took our investigations to the next level. We had not been involved in the record creation, so we could not be sure of how the features had been formed. One record, which at face value looked like a well-formed table, turned out to be composed of floating text boxes. Other records from different participants had designated "protected sections" in which automated fields had been used and then "fixed" in place. These sections included such essential metadata items as date, author, recipients, and a unique reference. Yet other records were composed on different computers with different settings, or included text that had been cut and pasted from a different application altogether. These "cut and paste" sections are affected differently than the rest of the document during and after a migration and can result in a change in the appearance of the record without adversely affecting its content.


These experiences allow us to examine unexpected record features and to assess more closely the ways in which records are created. As a result, we can formulate advice on the creation of records and use of record-creating applications, putting preservation concerns in place from the beginning of the records continuum.

The set of Microsoft Word experiments showed that generally, migration over generations was at least as reliable—and in some cases more reliable—than migration through generations. This may counter some of the scepticism about the costs of migration. Sceptics have argued that the recurring costs of migration will be too great to bear. Our results so far have shown that migration to each new version of an application should not be necessary, and we hope that experiments with other word processing applications will allow us to extend this hypothesis. The archival regulations of the Netherlands state that ministries are responsible for the authentic retention of their own records for the first 20 years, after which time a percentage of them are sent to the National Archives for long term archiving. The rest can be disposed of according to disposition regulations. It may be the case that Ministries can simply retain the documents in their original format, with maybe one or two controlled migrations, until the twenty years have passed. The Archives can then undertake more suitable long-term action concerning the current and future formats of the record.

This is simply one possibility that we are considering. We still have many more experiments planned, and are waiting until all of the results are in before we release full advice on the long-term preservation of records produced by government agencies. The combined results will allow us to weigh alternative approaches, in conjunction with metadata and authenticity requirements, and determine the best way to implement these approaches. There are many different ways to carry-out the same task and it is unlikely that a "one size fits all" approach will be suitable for different record types with different retention requirements.

Products, Tools, and Deliverables

In addition to the research results, the Digital Preservation Testbed will also develop a more concrete set of tools and products. These include the Testbed Research Database, which supports our experiments and acts as a valuable source of knowledge on many aspects of digital preservation. This database is being built over the course of the project and aims to collect and provide commentary on relevant digital preservation literature. The contents are not limited to publications, but also include public listserv messages, Web sites, presentations, and Testbed references. Wherever possible, we have gathered electronic copies of the research documents and stored them in the Testbed system for reference by the project team. These are supplemented by online resources (for which the URLs are checked on a regular basis) and printouts. The research database is easily searchable and will be a valuable record of the project, as well as a useful resource. An abridged version for online resources only is available on our Web site.


Other deliverables include white papers on each of our preservation strategies. The white paper on migration was published in December 2001, and the XML for preservation paper is scheduled for publication in late summer 2002. These white papers aim to provide a synopsis of current knowledge about each preservation approach, and to delineate ways in which the approach can be implemented. We will also deliver a technical report on the Testbed system itself, for which an extensive set of functional requirements is being developed. The Testbed Newsletter is produced quarterly and the Web site is updated monthly.

Thus far, the Testbed has completed groups of experiments on emails and text documents and is issuing preliminary advice on the short and long-term preservation of emails. Work will soon commence on spreadsheets and databases, for which advice will also be released. The Testbed project is due to finish in October 2003 but preservation research is unlikely to stop there. The Digital Longevity Program will continue to run and coordinate digital recordkeeping and archival efforts for the Government of the Netherlands.

Footnotes
(1) Jeff Rothenberg & Tora Bikson: Digital Preservation. Carrying Authentic, Understandable and Usable Records Through Time. (The Hague, 1999) (back)
(2) The Testbed and the Taskforce are also part of the ICTU, a non-profit organisation established by the Dutch government to house their e-government projects, including PKI (Public Key Infrastructure) and advies overheid (monitoring and advising on government Web sites at every level). See http://www.ictu.nl for further details. The close proximity of these projects allows them to easily collaborate and share information. (back)
(3) Testbed Digitale Bewaring white paper, Migration: Context and Current Status (The Hague, 2001). (back)
(4) See for example Kees van der Meer et al, Emulation and Conversion: Organisational and Architectural Overview of an Electronic Archive (Technical University of Delft and Utrecht University, 2001). This can also be seen by comparing current literature from around the globe dealing with preservation strategies and enforcement. (back)
(5) See the Testbed White Paper on Migration (op cit) for a more extensive discussion on each of these approaches. (back)
(6) Rothenberg & Bikson, op cit. See also Jeff Rothenberg: An Experiment in Using Emulation to Preserve D igital Publications NEDLIB Report Series (The Hague: NEDLIB consortium, 2000) and Avoiding Technological Quicksand: Finding A Viable Technical Foundation for Digital Preservation: A Report to the Council on Library and Information Resources. (Washington, D.C.: Council on Library and Information Resources, 1999). (back)
(7) The Testbed uses two sorts of records in its experiments. The first are model records, which are created in the Testbed to examine and evaluate the effects of preservation action on specific record features (e.g., user-defined and automatic fields, templates, font and paragraph formatting, and signatures). The advantage to using our own records for this purpose is that we know exactly which features are present in each record and where. This allows us to carry out highly focussed experiments on record attributes. It also serves as a good starting point for any new round of experiments. The second sort of records includes test records. These are obtained from ministries and other government organisations, and are used in the larger-scale experiments that address the fundamental research questions and authenticity requirements. (back)
(8) This section is intended to give the reader a flavour of the types of results we have gathered so far. Future reports from the Testbed will discuss our results to a greater extent than this introductory article. (back)
(9) The exact position of the text on the page can be affected by as small a thing as changing the printer or printer driver. This is not a change that has affected the authenticity of the records in any of our experiments to date, but it has resulted in documents containing several more pages than they did originally, especially if page breaks have been employed. (back)




Highlighted Web Site

Chilling Effects: Monitoring the Legal Climate for Internet Activity

This Web site was launched in February 2002 by the Electronic Frontier Foundation (EFF) and four law school legal clinics, representing Harvard University, Stanford University, UC Berkeley, and the University of San Francisco. The list of participating clinics is expected to grow. The site provides detailed information about Internet users’ legal rights in their online activities. Legal topics covered include DMCA (Digital Millennium Copyright Act) and other copyright issues, domain names and trademark law, anonymous speech issues, and defamation. For each topic, the site provides a basic legal overview, basic guidelines, FAQs, and related links.

The most distinctive feature of this site is the “Cease & Desist” database. Chilling Effects Clearinghouse has encouraged Internet users to post cease-and-desist letters sent to restrict their online activities. Law students review the collected letters and annotate them with related resources. The letters are searchable by subject or keywords. Each search result includes a cease-and-desist letter, related FAQs and related links.

This Web site allows librarians and archivists track online legal trends, justify their own online activities, and, most of all, to defend their legal rights on the Internet. In addition, this site provides a connection for groups and organizations that are threatened by the same kind of cease-and-desist letters. With its focus on intellectual property laws, this site is an invaluable information source.

chilling effects logo

print this FAQ

FAQ

Where are they now? Digitizing Microfilmed Newspapers

We often read about new projects and programs in RLG DigiNews, but what about past efforts? What results have been produced in the five years since RLG DigiNews began publishing?

Introduction

In this issue, we continue a feature started in our five-year anniversary issue (April 2002) and take another look at projects originally reported on during our first year of publication. In August 1997, an article in RLG DigiNews by Alan Howell discussed three International Newspaper Film Scanning Projects. The projects profiled were the scanning of the Burney Collection at the British Library, the Caribbean Newspaper Imaging Project of Cuban and Haitian newspapers at the University of Florida, and the scanning of mid-19th century periodicals and newspapers for the Australian Cooperative Digitization Project (ACDP).

As source material for digitization, newspapers are amongst the most challenging. Large page size, tiny type, halftone images, haphazard layout and poor bibliographic control are commonplace. Newspapers scanned from microfilm introduce additional difficulty in terms of image quality, processing, and sometimes cost. These factors complicate capture, delivery and access control and lead to a variety of technical and management obstacles.

Given the technological advances that have occurred in the past five years, we thought it would be instructive to revisit these projects to learn where they are today, what lessons have been learned, and how the state-of-the-art has changed. We were able to obtain updates on two of the three original projects, the Burney Collection scanning initiative and the Australian Cooperative Digitization Project. The Caribbean Newspaper Imaging Project (CNIP) is in the midst of a "technological renovation" (migrating from CD-ROM to Web) and was unable to meet our publication deadline. However, project reports from two earlier phases are available on CNIP's Web site (1).

Scanning the Burney Newspaper Collection at the British Library
Contact person: John Goldfinch, Early Printed Collections, The British Library

The Burney collection consists of 700 volumes of 17th-, 18th-, and 19th-century newspapers. It is especially prized for its coverage of 18th-century London newspapers, including many unique items. Owing to its age and condition, the originals are not available for study, and the popular collection is currently viewable only on microfilm.

The original digitization work on the Burney collection was actually a British Library (BL) experiment that started in 1992 with an effort to determine the utility of its Mekel microfilm scanner in digitizing the library's microfilm holdings. From 1993 to 1996, BL attempted to learn what digital technology could do for it, as part of its Initiatives for Access program. The Burney collection was chosen for testing in part because of the challenges it presented. The original documents varied substantially in density, both across and within pages, and much of the type is broken, leading to film described by Hazel Podmore in BL's write-up of its experiment as "not the best-quality the Library has ever produced (2)."


As expected, digitizing the Burney film proved difficult. According to John Goldfinch, the principle difficulties stemmed from the "expense and availability of high capacity storage for the files being created, together with the level of manual intervention required to deal with such things as deskewing the images, and coping with the highly variable print quality of the originals and deficiencies in the film."


In fact, Podmore noted that storage considerations led to most scanning being carried out at a less-than-optimal 200 dpi. Another technical obstacle, considered insurmountable at the time, was the inability of any available Optical Character Recognition (OCR) package to produce acceptable machine-readable text for searching and indexing. At the time, the difficulties were so significant the British Library concluded that it could not justify continuing digitization of the Burney collection.

BL feels that much has changed in the intervening years. Storage costs have dropped by about two orders of magnitude (about 100 times). John Goldfinch asserts that "recently demonstrated developments in OCR technology offer the exciting prospect that OCR is at last able to cope with the difficulties of early type." BL and other institutions have gained considerable experience in the scanning of microfilm of early printed material.

As a consequence, BL is revisiting the question of the Burney newspapers, and has received a grant from the National Science Foundation to begin creating a fully searchable on-line library of British 18th century newspapers. As it had originally hoped to do in the mid-1990s, BL now plans to produce a complete set of images along with an index of titles with issue dates and numbers, and to make the complete collection freely available to researchers over the Web. A precise release date is not yet available.

Australian Cooperative Digitization Project, aka the Ferguson Project
Contact person: Ross Coleman, Collections Coordinator, University of Sydney Library

The Australian Cooperative Digitization Project is a collaboration involving the University of Sydney Library, the State Library of New South Wales, the National Library of Australia and Monash University Library, amongst others. Ross Coleman and Colin Webb summed up the project's purpose as the "enhance[ment of] literary and historical research on nineteenth-century Australia by providing improved access to, and preservation of, scarce primary material confined to a few major library collections (3). The material selected for digitization, based on Ferguson's Bibliography of Australia, is confined to a critical six-year period in Australian history (1840-45). Though preceding the Australian gold rush, these materials represent a historical gold mine and provide a defining record of a distinct Australian colonial culture.

ACDP was carried out as a true hybrid project, insuring that preservation quality microfilm existed or would be created for every title, and that stringent quality control guidelines governed the creation of digital images from the film. Although conceived as an experiment with a strong mandate to develop policies and procedures that could be applied to other Australian digitization efforts, ACDP also carried significant production expectations. Sixty-seven periodical titles (including newspapers) and four novels were ultimately digitized.

In providing a retrospective to RLG DigiNews, as well as in previous summaries of the project (see http://www.nla.gov.au/ferg/about/ for several references) Ross Coleman and his colleagues have been unusually candid about the obstacles they encountered in bringing ACDP to fruition. Particularly noteworthy, and of value to anyone contemplating a digitization program of any size, is the excellent discussion in Webb and Coleman of the tug-of-war between workflow-enhancing automation procedures and the need to maintain sufficient quality for text legibility and successful processing via OCR (4).

For example, most of the existing film was not of sufficient quality for digitization, and even the refilmed material presented some barriers to fully automated digital capture. As is so often the case in library and archive projects, quality requirements ultimately ruled, but a heavy price was paid in terms of missed deadlines, tense vendor relations, loss of staffing continuity, and frayed nerves.


As might be expected, newspapers proved particularly troubling, especially because of their size. However, Ross Coleman identified additional obstacles stemming from "the variety within any one title, from foxing and discoloration, to the use of varying fonts and point sizes on the one page." Coleman also acknowledges that newspapers require quality OCR in order to truly justify their digitization, since they generally lack even rudimentary indexing. Unfortunately, marginal print quality and type size variation often thwart the creation of an accurate body of searchable text, even with current technology. (For example, ProQuest Historical Newspapers reports 80-90% OCR accuracy for the article text from its New York Times microfilm.)

Despite having successfully completed its initial objectives, ACDP exceeded its anticipated resource consumption to such a degree that conversion of additional 19th century periodical titles has been put on hold. Within the selection of periodicals, newspapers continue to be viewed as especially daunting targets for digital capture. Coleman reports "the fact that no more have been done, or even contemplated, highlights the fact thatat the timewe were not confident in the technology, or our procedures, or in the effectiveness in delivering such things over the Web in a usable manner."

So while ACDP has succeeded in greatly expanding access to a corpus documenting an important slice of Australian history, it has not, as yet, provided the basis for expanded conversion of other materials from that period.

Conclusion

In revisiting these two projects, we encountered somewhat different perspectives about the current viability of digitizing, OCRing, and providing Web access to microfilmed newspapers. One possible explanation for the differing opinions is the timing of the initiatives. ACDP started out as the Burney experiment (along with many other early digitization experiments) was wrapping up. Burney was conceived as more of an experiment, and was carried out at a time when the technology was clearly not up to the task. It was shelved until very recently, when the technology seemed like it might finally be able to tackle the challenge.

On the other hand, ACPD was conceived as a production enterprise, and was carried to completion despite knocking against technological barriers at several points. Having only recently completed the mounting of files, ACDP is still reticent about taking on additional conversion, given the technological obstacles it encountered.

It is noteworthy, however, that what most distinguishes Ross Coleman's perspective from that of John Goldfinchs has little to do with the technological underpinnings of the respective projects. Although both speak to the frustrations of digitizing challenging older materials, the most striking difference is in Coleman's emphasis on the obstacles created by management issues. Problems faced in the management arena remain underreported and under-discussed within digital imaging circles, compared to those in the technical realm. Even as some (though by no means all) the technological barriers to effective large-scale digitization of older printed materials begin to fall, we would be wise not to downplay the ongoing challenges represented by funding, staffing, vendor relations, planning, and the like.

Perhaps the ultimate lesson from the experiences described above is that there is still no such thing as a large-scale, cookie-cutter digitization project. Despite many successfully completed efforts and improved availability of training and documentation, the work remains technically complex, time-consuming, and expensive. Working from marginal source materials introduces additional complexities, and newspapers continue to push the limits of current digital capture, image processing, OCR, and Web delivery technologies.

Further reading

In addition to the references already given, here are some useful readings on recent newspaper digitization efforts:

The ProQuest Historical Newspapers project (backfiles of the Christian Science Monitor, the Wall Street Journal, the New York Times, the Washington Post and Canadian newspapers digitized by Cold North Wind ("practically every newspaper published in Canada from 1750 to 1950") with plans to add other national, regional and local publications). The home page provides links to a slide show about the project. An additional demo is also available.

OCLC Digital & Preservation Resources and Olivesoft digitization of historic newspaper collections (an initiative "to help libraries provide full online searchable access to their historic newspapers"). Read the press release for this collaboration and read about Olivesoft's ActivePaper Archive software.

The Nordic Digital Newspaper Library (Nordic Newspapers from 1640-1860). Read a paper by Majlis Bremer-Laamanen presented at the 2001 Annual Meeting of the United States Newspaper Program held at the Library of Congress in Washington, DC on April 26th 2001.

Digitisation of Newspaper Clippings: The LAURIN Project by G|nter M|hlberger. RLG DigiNews, v. 3, no. 6, December 15, 1999.

--RE


Footnotes
(1) Erich Kesse, Robert Harrell, Richard Phillips and Cecilia Botero, Caribbean Newspaper Imaging Project, Phase I: Imaging and Indexing Model and Phase II: OCR Gateway to Indexing. (back)
(2) Hazel Podmore, The Digitisation of Microfilm in L. Carpenter, S. Shaw and A. Prescott, eds., Towards the Digital Library (London, 1998). (back)
(3) Colin Webb and Ross Coleman, Digital conversion of Nineteenth century publications—Production management in the Australian Cooperative Digitisation Project 1840-45. LASIE, v. 31 no. 2, Jun
e 2000, pp.5-20. Also available in HTML. (back)
(4) Ibid. (back)


calendar of events


Calendar of Events

DELOS International Summer School on Digital Library Technologies
July 8-12, 2002
Pisa, Italy

The DELOS Network of Excellence Second International Summer School on Digital Library Technologies will focus on digital library applications. Attendees from both the computer science community, the industrial communities (electronic publishing, broadcasting, software industry) and the user communities interested in digital library technologies (libraries, archives, museums) will be at the conference.

Workshop at the Joint Conference on Digital Libraries
July 18, 2002
Portland, OR
This workshop is aimed at developers, researchers, educators, and administrators interested in educational programs for training the next generation of digital library professionals.

Seventh International Summer School on the Digital Library
July 28-August 1, 2002; August 4-9, 2002
Tilburg University, The Netherlands
September 29-October 4, 2002
Florence, Italy
November 3-7, 2002
Leeds, England
This year, the Summer School will consist of four one-week courses: Managing the Change Process towards your Library of the Future (offered twice); Digital Libraries and the Changing World of Education; and Electronic Publishing: Libraries as Buyers, Facilitators, or Producers.

DRH 2002: Digital Resources for the Humanities Conference
September 8-11, 2002
Edinburgh, Scotland
This conference brings together the creators, users, distributors, and custodians of digital resources in the Humanities. Themes this year include: Provision and management of access; Digital libraries, archives and museums; Time-based media and multimedia studies in music and performing arts; Network technologies used to support international community programs.



Announcements

The Data Dictionary for Technical Metadata for Digital Still Images
Of great value to the digitization community is this draft standard (NISO Z39.87/AIIM 20). The focus is on defining descriptive metadata and technical metadata. Z39.87 is an important building block to support the development of applications to validate, manage, migrate, and otherwise process images so they maintain their value over time.

SEPIA (Safeguarding European Photographic Images for Access) Research Available
This working group report focused on the effects of scanning equipment on original photographic materials. Existing research and data from testing are included to formulate concrete recommendations in selecting scanning equipment.

IMLS REPORT: Status of Technology and Digitization in the Nation's Museums and Libraries
The Institute of Museum and Library Services (IMLS), issued this report that is based on a digitization survey sent to state library agencies, academic libraries, public libraries, and museums.

Digital Preservation Coalition Launches Web Site
This site includes an online edition of Preservation Management of Digital Materials, and includes a Web version of the first issue of What's New in Digital Preservation, a collaboration between the DPC and PADI.

Digital Imagery for Works of Art Final Report
In November 2001, the National Science Foundation, The Andrew W. Mellon Foundation, and the Harvard University Art Museums jointly sponsored an invitational workshop on imaging for works of art. This workshop was designed to bring together computer and imaging scientists who have been active in digital imagery research with research scholars in the visual arts, including art and architecture historians, art curators, and conservators.

Using Metadata to Manage Digital Video Archives
As digital video collections keep growing, it will become ever more important to be able to use automatic indexing techniques to facilitate effective information retrieval. Video that is "born digital" will have increasing amounts of descriptive information automatically created during the production process. The report provides guidance on metadata for video archives.



RLG News
RLG and OCLC Release Two Final Reports

Trusted Digital Repositories: Attributes and Responsibilities
In May, the RLG/OCLC Working Group on Digital Archive Attributes released their final report, Trusted Digital Repositories: Attributes and Responsibilities. One of two initiatives working in the context of the Reference Model for an Open Archival Information System, this report is primarily intended for cultural institutions such as libraries, archives, museums, and scholarly publishers and is specifically aimed at those with traditional or legal responsibilities for the preservation of cultural heritage. It is written to aid senior administrators as well as those implementing digital archiving services.


Following a short historical introduction, the report presents a brief definition of "trusted digital repositories," provides some examples of the circumstances in which institutions are undertaking their creation, and speaks to the nature and achievement of trust. It addresses the seven attributes such repositories must have and discusses requisite responsibilities at both the higher organizational/curatorial level and the operational level. Finally, the report looks at how repositories can be certified and summarizes seven key recommendations.

An appendix to the RLG-OCLC report provides technical overviews of the "Reference Model for an Open Archival Information System" (OAIS); - a common framework for describing and comparing architectures and operations of digital archives. (Compliance with this model is a defining attribute of a trusted digital repository. In January 2002 RLG established an OAIS resources page and discussion list at its Web site to assist implementers.) An operational responsibilities checklist, a glossary, and selected additional resources round out the report.

Trusted Digital Repositories: Attributes and Responsibilities benefits from international discussion and excellent feedback on a draft that RLG and OCLC released for community comment in August 2001. Stakeholders in the effort to preserve digital materials were urged to contact RLG program officer Robin Dale with their comments on the earlier Attributes of a Trusted Digital Repository: Meeting the Needs of Research Resources. Dale continues to welcome comments and ideas on the final publication.

A Metadata Framework to Support the Preservation of Digital Objects
In early June, the OCLC/RLG Preservation Metadata Working Group released its final report, A Metadata Framework to Support the Preservation of Digital Objects. The report is a comprehensive guide to preservation metadata that is applicable to a broad range of digital preservation activities. Preservation metadata is the information infrastructure necessary to support processes associated with the long-term retention of digital resources, and is an essential component of most digital preservation systems.


The report represents the consensus of leading experts and practitioners comprising the working group, and is intended for use by organizations and institutions managing, or planning to manage, the long-term retention of digital resources. The working group based its work on preservation metadata element sets developed by several leading institutions and organizations in the digital preservation community, as well as the OAIS reference model.

The report follows on the working group's earlier white paper Preservation Metadata for Digital Objects: A Review of the State of the Art, which defined and discussed the concept of preservation metadata, reviewed current thinking and practice in the use of preservation metadata, and identified starting points for consensus-building activity in this area.


Publishing Information

RLG DigiNews (ISSN 1093-5371) is a newsletter conceived by the members of the Research Libraries Group's PRESERV community. Funded in part by the Council on Library and Information Resources (CLIR) 1998-2000, it is available internationally via the RLG PRESERV Web site. It will be published six times in 2002. Materials contained in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given for the material in RLG DigiNews to be used for research purposes or private study. RLG asks that you observe the following conditions: Please cite the individual author and RLG DigiNews (please cite URL of the article) when using the material; please contact Jennifer Hartzell, RLG Corporate Communications, when citing RLG DigiNews.


Any use other than for research or private study of these materials requires prior written authorization from RLG, Inc. and/or the author of the article.


RLG DigiNews is produced for the Research Libraries Group, Inc. (RLG) by the staff of the Department of Preservation and Conservation, Cornell University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern; Production Editor, Barbara Berger Eden; Associate Editor, Robin Dale (RLG); Technical Researchers, Richard Entlich and Peter Botticelli; Technical Coordinator, Carla DeMello; Technical Assistant, Kimberly Gazzo.


All links in this issue were confirmed accurate as of
June 10, 2002.

Please send your comments and questions to preservation@cornell.edu.

   
 
RLG DigiNews
BROWSE ISSUES
SEARCH
RLG