![]() |
|
|
|
|
|
|
|
|
|
||
![]() |
|||||||||||||||||||||||||||
| June 15, 2002, Volume 6, Number 3 |
ISSN
1093-5371
|
||||||||||||||||||||||||||
| Table of Contents Editors' Interview Feature Article 1 Feature Article 2 Highlighted Web Site Chilling Effects: Monitoring the Legal Climate for Internet Activity FAQ Where Are They Now? Digitizing Microfilmed Newspapers, by Richard Entlich Calendar of Events Announcements RLG News Editors' Interview
The Internet Archive Brewster Kahle The Internet Archive brewster@archive.org Editors' Note The editors interviewed Brewster Kahle by phone on May 15, 2002. Here is an edited version of the interview. Brewster Kahle is the founder and director of the Internet Archive; co-founder of Alexa Internet, an Internet-focused company that concentrates on Web navigation tools and techniques; and inventor and founder of Wide Area Information Servers, Inc. The Internet Archive launched the Wayback Machine, a Web site that provides an interface to the Internet Archive collections, in October 2001. General Operations You launched the Wayback Machine a little more than six months ago. How has the actual response compared to your expectations?
What kind of organizational infrastructure does it take to run the
Internet Archive and the Wayback Machine?
Content
and Capture
Do the rate of growth and sheer size of the Internet Archive present
special technological problems?
You have a number of special collections on your site, such as Election
1996. Do you have an ad hoc approach as subjects of interest surface or
do you have a collection plan? The Internet Archive FAQ encourages donations
of collections. Will your special collections program expand?
Your policy is to collect Web pages that are publicly available. Do
you have any estimates of how much of the Web is inaccessible to your
crawlers?
The Internet Archive FAQ indicates that the addition of a robot exclusion
to a Web site will "lead to the removal of the pages from the Wayback Machine." What amount of content has been removed as a result
of this policy? How is the removal documented?
Web
Crawlers and Access
As of October 2001, the Internet Archive offered access to roughly 10 billion pages. How long does it take to crawl the entire Web? Do you give any preference to crawling particular portions of the Internet more often or more thoroughly? If so, how? There are more pages now, but we do not have a more recent number. The Internet Archive receives most of its Web pages from donations from Alexa Internet. They do a new crawl every two months. We crawl for special collections weekly or daily. If we discover a new page, we crawl it within 24 hours. We do a complete sweep every two months. A powerful Web crawler must be essential to the development of the Internet Archive. Could you describe how your Web crawler support has changed in response to the evolution of this technology since you began the Internet Archive in 1996?
In searching for specific pages/sites on the Wayback Machine, we've noticed varying gaps in content, sometimes within a page, e.g., missing images, sometimes within a site, e.g., missing pages, and sometimes over time, e.g., long gaps between capture dates. What strategies have you developed for crawling to maximize capture and to validate the content that is captured, or do you use a passive capture approach?
Your policy is to provide "free access to the Internet collections to researchers, historians, and scholars through an account on a Unix server." Do you allow Web crawlers to access the Internet Archive? Does the Wayback Machine provide access to all of the Internet Archive content?
Does the Internet Archive support the Open Archive Metadata Harvesting
Protocol?
Digital
Archives: Technical
Issues
We have a series of preservation-related questions for you. First, has the Open Archival Information System (OAIS) reference model influenced the development of the Internet Archive? If not, is the Internet Archive based upon a digital archive model?
Your FAQ on storage and preservation indicates that you will migrate
your storage media but use emulation for your file formats. Did you consider
file migration? Are you exploring other options? Are you working on any
research in these areas?
Figure 1. JFK Library Web page captured August 15, 2000 points readers to news releases from 2002 via the "What's New" link because the frame is
from the earlier date and the news page is from the current date. The frame is using referential linking to the current page.
(Click on image to enlarge) Your preservation FAQ references the ARC format proposed by Alexa Internet as an Internet object archiving standard. Is Alexa trying to make this a formal standard, and if so, through which standards body?
We understand that you are trying to maintain multiple copies of the entire set of collections. Are these mirror sites? Does the amount of content present special technical issues for establishing redundant storage?
Digital
Archives: Financial
Issues
Your Web site states that the Internet Archive was founded to "build an 'Internet library' with the purpose of offering permanent access" to the content. Have you established a sustainable funding model for the Internet Archive? Is there a business continuity plan that will ensure ongoing access to the collections?
We noticed the monetary donation button on the Internet Archive site
via Amazon.com. Is this similar to NPR's
pledge drives? What kind of response have you gotten? Do you use those
funds for specific purposes?
Legal
Issues
Can you discuss your recent brief for the Supreme Court case, Eldred v. Ashcroft? Do you anticipate the Internet Archive becoming more active in such cases?
Future Plans How do you measure success for the Internet Archive? What are your plans? Do you have specific goals that you would like to achieve?
Automated Digitisation of Printed Material for Everyone: The METADATA ENGINE Project Günter Mühlberger (1) University Library Innsbruck guenter.muehlberger@uibk.ac.at The European Union R&D project METADATA ENGINE focuses on the digitisation of printed material such as books and journals. The project comprises 14 partners from 7 European countries and the US. Some of the libraries among our partners play leading roles in the field of digitisation, including the National Library of France and Cornell University Library (2). The project is co-ordinated by the University of Innsbruck. It started in September 2000, and will be finished by spring 2003. The main objectives of the project are to:
Why METADATA ENGINE? The basic approach of the project is to automatically create and record as much administrative, descriptive and structural metadata (3) as possible during the conversion process. Using the METAe engine, the routine workflow will result in a full description of the digitized document. The following table gives an illustration of the metadata gathered during digitisation:
Table 1. Metadata Creation During the Conversion Process
Figure
1. Thumbnail of a screenshot of the METAe GUI
Figure 2. Cropping of Single Pages by Utilizing the Print Space of Books
Figure 3. Dynamic Binarization and Detection of Graphical Zones Figure
4. Black Letter Fonts (12)
Currently, no OCR is capable of reading these characters without training, which is a major drawback for all digitisation projects in Europe. One of the leading companies in OCR technology, ABBYY Europe is responsible for providing this missing link within the METAe project. Since OCR engines rely heavily on background dictionaries, these dictionaries will have to be supplemented with historical forms and words no longer in use. The ABBYY engine shall be available as part of the METAe engine and as a separate commercial product (13). Segmentation and Hierarchical Ordering The main feature of the METAe engine is its capability for automatically labelling books and journals according to their logical structure. Among the elements that can be detected are page numbers, running titles, chapter headings, titles, footnotes, margin notes, and paragraphs. Moreover the METAe engine will extract the hierarchical structure; e.g., chapters of a book, or issues and articles within a journal. For the segmentation of the documents, the METAe engine utilizes the fact that most books exhibit internal consistency; i.e., all headlines at a certain hierarchical level are expressed with the same type and style (bold, centered, etc). If there are sufficiently accurate results from the (physical) layout analysis, it will be possible to find similar elements, group them, and apply labels (14). This feature will be especially helpful for documents, including journals and magazines, that contain a number of single intellectual items that might be recorded individually. Added value and benefits Cleansed Body Text One might ask what advantage the detailed labelling of minor elements such as headlines, footnotes or page numbers might have. In order to explain why we believe that this is one of the most innovative features of the METAe engine, we need to understand that books are composed of different functional layers. One layer is what copyright law knows as "the work", i.e. the intellectual item that exists independently from its concrete presentation. Another layer serves the need of the reader to navigate through a book. Elements of this layer include tables of contents, volume indexes, and running titles. Still another functional layer shows an advertisement that has nothing to do with the intellectual work, but might have a value on its own. In general we can say that many elements found in paper-based books are not needed any more in the electronic environment. Many elements that are helpful in books are noise from the point of view of accessing electronic text. This idea can be illustrated with the following example.
Figure 5. Page Image from Scientific Journal (1930) Figure 6. Raw OCR Text of the Page Image of Figure 5 Figure
7. Cleansed instead of corrected OCR text In figure 7 we see that the "real" OCR errors of this page from 1930 (scanned at 300 dpi, 8 bit grey-scale) are very rare and are no obstacle to presenting the uncorrected OCR text to the reader. In fact only one real OCR error can be found in the running text (apart from the caption line which might be corrected manually since it will also form the title of a Dublin Core record for this illustration). Obviously this cleansing process will not only be carried out on single pages but will also include noise reduction at the document level. The cleansed full-text will open up new avenues for use. It might lead to a new presentation model for digitized documents on the Internet; e.g., the cleansed full text in the front, and the page image (or parts of it) in the background. It might also lead to new products and some potential commercial benefits for libraries. For example, publishers of e-Book collections will be able to provide their users with millions of cleansed (albeit not corrected) text pages. In the rare case where a user really needs to check whether a word is correct or not he will still have the chance to access the page image on the Internet. Book Collections as Picture Collections Another simple but effective benefit has to be mentioned here as well. From the 1880s onwards, more and more printed documents contain pictures and halftones. In the case of such illustrated books or journalsthe "Garden and Forest" collection at the Library of Congress (14), for instancethe text collection will also serve as a picture collection. The page images are kept in grey scale, their caption is labelled automatically, and so is their location within the original document. For the user this will mean that it will be possible to search within all caption lines of a collection and to retrieve just the pictures: ![]() Figure
8. Book Collections as Picture Collections: An Example from Garden and
Forest
Digitisation as a Permanent Service We are convinced that the METAe engine will give libraries the opportunity to create new and effective business models for digitisation. The key for this expectation is that with the METAe engine, digitisation will become much simpler than before. Input and output will be highly standardized, the vast majority of processing steps will be done automatically in the background, and the operator will be needed only for quality control and correction. Such a digitisation process will be easier to establish and libraries might be able to integrate it as a permanent service into their service portfolio. Libraries might provide digitisation on demand, or digitisation of rare books for the needs of a course or a research project. Conclusion The project team is convinced that the METAe engine will provide a feasible tool for in-house digitisation of library and archival collections. In order to gain experience from real world applications, the METAe engine will be installed at several METAe partner sites during the fall and winter of 2002. In the first months of 2003 a report will be released about the performance of the engine and best practise models for using it in ways that best fit the needs of libraries and their users. Footnotes (1) This is a summary of the work jointly carried out by the participants in the METADATA ENGINE project. I would, nevertheless, like to add special acknowledgments for the following persons: Michael Day, Alexander Egger, Paolo Frasconi, Claus Gravenhorst, Kurt Habitzel, Juha Hakala, Marco Kvttstorfer, Simone Marinai, Gregor Retti, Oya Rieger, Birgit Stehno, Jupp Stvpetie, Simon Tanner, and Ralph Tiede. (Back) (2) Partners of the project are: University Innsbruck (co-ordinator), Austria; University of Linz, Department for Applied Informatics, Austria; Mitcom (Abbyy Europe) Neue Medien GmbH, Germany; CCS Compact Computer Systeme, Germany; University Alicante, Spain; Friedrich-Ebert Foundation, Germany; Cornell University Library. Department of Preservation and Conservation, USA; Bibliothhque Nationale de France; The National Library of Norway, Rana division, Norway; Biblioteca Statale A. Baldini, Italy; Dipartimento di Sistemi e Informatica, University of Florence, Italy; University Graz Library, Austria; Scuola Normale Superiore, Centro di Ricerche Informatiche per i Beni Culturali, Italy; Higher Education digitisation Service HEDS, UK. (Back) (3) Cf. Library of Congress Digital Repository Development. Core Metadata Elements. (Back) (4) Logical and physical levels are always closely linked. A sharp separation might therefore lead to "artificial" and "peculiar" results. The team prefers to regard them as different perspectives of the same subject. (Back) (5) The reasons for taking the METS schema are manifold. To mention just a few: Firstly, METS emerged from the MOA II white paper and has therefore not been developed from scratch but has a strong practical implementation aspect. Secondly, it has an open and flexible structure and, thirdly, it is publicly available at the Library of Congress, and it is, above all, well described. (Back) (6) International Imaging Industry Association. DIG 35 Initiative Group. (Back) (7) A draft document of the ALTO file is already available. After the testing and validation phase the ALTO file will be described in more detail and published on the METAe project homepage. (Back) (8) C.f. Stehno, Birgit and Retti, Gregor: Modelling the logical structure of books and journals using augmented transition network grammars. In: Journal of Documentation. (paper will be edited in 2002). (Back) (8) C.f. URL: http://www.4digitalbooks.com/. (Back) (9) Benchmark for digital reproductions of monographs and serials. As endorsed by the DLF (January 25, 2002). (Back) (10) The library might decide to store textual zones as 1 bit files in order to keep the file size low.(Back) (11) Black letter fonts for the electronic environment are provided by: Ligaturix - der Frakturkonverter. A collection of different black letter fonts can be found at: URL: http://www.fraktur.com/. (Back) (12) Cf. a METAe project paper on black letter fonts: URL: http://heds.herts.ac.uk/METAe/Articles/art04_2.htm (Back) (13) The natural limit of the automated process has to be mentioned here once more: If there are intellectual structures in a work which do not have a recognisable representation in the layout, the engine will not be able to recognise them automatically. (Back) (14) Garden and Forest: A Journal of Horticulture, Landscape Art, and Forestry (1888-1897). A joint project of the Library of Congress Preservation Reformatting Division, the University of Michigan Making of America project, and the Arnold Arboretum of Harvard University. (Back) Researching Long Term Digital Preservation Approaches in the Dutch Digital Preservation Testbed (Testbed Digitale Bewaring) Maureen Potter Digital Preservation Testbed, Netherlands Maureen.Potter@ictu.nl In 1996, the Netherlands Ministry of the Interior and the Ministry of Education, Culture and Sciences initiated a collaborative programme entitled Digital Longevity (Digitale Duurzaamheid). This programme, run in conjunction with the National Archives, sponsored Jeff Rothenberg's 1999 publication, Carrying Authentic, Understandable and Usable Records Through Time, which proposed establishing a testbed to carry out research into possible approaches for the long term digital preservation of archival records (1). The Digital Preservation Testbed (Testbed Digitale Bewaring) was born the following year. This article introduces the work of the Digital Preservation Testbed. It first places the Testbed in context within the rest of the Digital Longevity programme and defines the scope and goals of the project. Our objectives and research questions are identified, followed by a review of the rigorous scientific approach that the Testbed takes in its experiments. The benefits of this are highlighted, as is the practical nature of the Testbed. Finally, the products and deliverables that are expected to emerge throughout the course of the project are discussed and identified. Background and Scope The Digital Preservation Testbed is part of a wider network of initiatives that the Dutch government has established to deal with the challenges posed by the electronic era. The Testbed belongs to the Digitale Duurzaamheid Programme, whose overall aim is to guarantee the accessibility of information held by the government in digital form (2). Three other projects complete the Digitale Duurzaamheid programme: the RecordKeeping System (RKS) project, establishing guidelines and providing advice to Dutch Ministries on the selection of an RKS; the Kwaliteitzorg, concerned with ensuring the quality of the records being produced electronically; and the Taskforce DigitaleDuurzaamheid, whose main aim is to raise awareness of the digital longevity issues throughout government. The goal of the Testbed within the Digitale Duurzaamheid programme is to help achieve the lasting accessibility of government information in digital form. The Testbed will provide advice that is tailored to the situation here in the Netherlands. Our focus is on the preservation of electronic records for the long term, and our strategy begins with preparing for the preservation of records from their point of creation. Our intention is to ensure the reliable creation and management of electronic records so that they are in a suitable state for long-term preservation action. The Testbed is running controlled experiments to explore options for long-term preservation approaches and the advice on these will be issued to the Dutch government later this year. Our research is initially limited to four main alphanumeric record types: text documents, email messages, spreadsheets and databases, all of which are widely used within ministries and government organisations. Three preservation approaches are under consideration: migration, emulation, and XML, which are discussed in more detail below. Four record types and three approaches result in 12 possible combinations. This initial set concentrates our resources and limits what is otherwise an exponential and unstructured research area. Also, not every record type is suitable for every preservation approach. For example, we do not consider it to be worthwhile to attempt emulation for emails. Email packages rely upon standard exchange formats that enable email systems to be interoperable. The sender and receiver will often perceive the look and feel attributes of a message differently. The question then becomes: "what exactly am I trying to preserve?" You need not preserve something that was not present in the first instance. Emulation is thus not the best match for a preservation approach to this record type.
The Testbed Research Framework translates these objectives into a clear set of research questions that are refined and updated throughout the duration of the project. These range from fundamental research questions that require comparing the results of large groups of experiments, to questions focusing on the role and significance of record features, attributes, and metadata that may be answered by individual and smaller groups of experiments. Fundamental Research Questions include:
These questions can be considered in light of cost, record type, authenticity requirements, and supporting resources, to name but a few. The subset of Attribute research questions includes:
These questions consider attributes in terms of record type, software, preservation approach, metadata implementation, and preservation function implementation. Defining essential preservation metadata is also a priority of the project, and is included in other research questions. Preservation Approaches As the Testbed Research Framework notes, record keepers have not yet defined an explicit and definitive methodology for any of the preservation approaches we are considering. The Testbed is contributing to the delineation of various appropriate methodologies. As discussed elsewhere, there are various ways to implement a migration strategy (3). The same can be said for emulation and the vaguely defined "XML approach." Indeed, there are so many different ways to formulate a strategy that the boundaries between them can become indistinct (4). Let's begin by defining the various approaches: migration, emulation, and the XML approach (5). The Testbed definition of migration is a relatively simple one: the transfer of records from one hardware and software configuration to another. This includes migration through and over generations of application versions, as well as across applications and operating systems, in practice often on either proprietary or part-proprietary software, but excludes media refreshing. Closely related to this is the XML approach. Conversion to XML (or conversion to standards in general) can be considered to be a form of migration. However, XML is not tied to any particular software system and is often regarded as the most promising present day format for archiving and interoperability. It has a multiplicity of uses, and so deserves to be considered as an approach in its own right. For the emulation strand of our experiments we will be working with IBM and will perform experiments with their UVC (Universal Virtual Computer) on archival records at the Testbed. The UVC approach is described in an earlier RLG Diginews article. The UVC is a variation of the emulation approach that addresses the problem of future interpretation of data files by writing a program to carry out the interpretation in the language of a Universal Virtual Computer. This strategy can be extended to archiving the program as well, making it more like the full emulation approach identified and proposed by Rothenberg (6). Methodology The Testbed is a practical project that runs controlled experiments in a secure environment to study potential preservation approaches, and to ascertain the effects of preservation actions on different file formats and record types. In order to provide valid and reliable results, the Testbed has established a rigorous experiment process, that clearly articulates the requirements for each experiment. An experiment process is defined around a specific preservation approach and record type, e.g., migration of text documents, or conversion of emails to XML. Each process consists of 12 documented stages. Once a process has been defined and run for the first time, further experiments can be run on that process using its generic requirements and procedures. ![]() Figure 1. Flow chart of Testbed Project Experiment Process For example: Experiment process 1 caters to the migration of text documents. The first five stages in the process are general stages concerned with the broad requirements of the preservation approach (migration) and record type (text documents). These stages define the exploration area, identify relevant background literature, specify authenticity requirements and evaluation criteria, develop an overall experiment process design, and estimate the required resource specifications. The remaining stages are then specific to an individual experiment; that is to say, they specify the records to be used in that experiment (e.g., Microsoft Word 95) and the specifics of the approach (e.g., migration to PDF). Further experiments can be run using the generic documents from the first five stages of the Process. Several further experiments can therefore be run using one experiment process. These may examine different formats of the record type (e.g., Word Perfect or Microsoft Word 2000 as text documents) or they may examine metadata issues. There are several advantages to this approach. The controlled, well-documented experiment process allows each experiment to be easily reconstructed in the future and allows experiments to be re-run to confirm or check results, if necessary. The process lays out requirements to be considered at each stage, from preservation requirements to functional requirements, metadata requirements to authenticity requirements. The iterative nature of the process allows us to produce a base set of documents for each set of related experiments, thus eliminating any duplication. Grouping experiments in this way also helps us to consider the combined results of sets of experiments and the overall "success" of each preservation approach. The secure and controlled Testbed environment, in which all of the experiments are carried out, ensures that our experiment results are valid and free from errors from extraneous sources. We carry out regular control and null experiments to ensure this remains consistent. The practical nature of the project brings other benefits as well. The experiment process covers all of the key points in the history of the record, from creation (by way of model records (7)) to capture, appraisal, and long-term storage. These aspects can have unexpected effects on the implementation and success of a preservation strategy. The Testbed considers all of these aspects within the scope of the experiment process, and they have yielded useful and interesting results. Preliminary Results (8) Our first experiments concentrated on the migration of text documents. This combination was chosen for the first experiments as the use of text documents is widespread throughout government and many organisations already carry out routine migrations when updating their computer systems. Our main goal in the early experiments was to examine and identify record features that changed as a result of the migration process. Microsoft Word was identified as a good starting point. It is one of the most widely used word processing applications available and is used by many agencies to produce government records. Experiments have taken place on the migration of text documents through and over generations of Microsoft Word using both model and test records. Migration through generations refers to migrating through successive versions of an application, e.g., from Word 95 to Word 97 then to 2000 and then to 2002. Migration over generations skips the intermediate versions and goes straight to the current highest version, e.g., from Word 95 to Word 2002. We have experimented with migration through and over generations of Word 95, Word 97, Word 2000 and Word 2002. We have also experimented with migration of Word files to PDF 1.2, 1.3, and 1.4. Results of the model record Word experiments showed that if the record had been created well initially, it stood a far better chance of retaining its features through and after migrations. Fields that update automatically (e.g., date fields) and that were not fixed after document creation wound up being updated every time the document was accessed, thus altering an essential content and context-reference feature. This is a problem whilst the records are still in active use, let alone when they reach the archives. However, most features migrated successfully. The position of the text on the page was sometimes different, but colour, paragraph and font formatting, bullets and numbering, inserted and well-formed tables, hyperlinks, pictures and diagrams were all successfully retained in the experiments we have carried out so far (9). Use of records donated by government ministries took our investigations to the next level. We had not been involved in the record creation, so we could not be sure of how the features had been formed. One record, which at face value looked like a well-formed table, turned out to be composed of floating text boxes. Other records from different participants had designated "protected sections" in which automated fields had been used and then "fixed" in place. These sections included such essential metadata items as date, author, recipients, and a unique reference. Yet other records were composed on different computers with different settings, or included text that had been cut and pasted from a different application altogether. These "cut and paste" sections are affected differently than the rest of the document during and after a migration and can result in a change in the appearance of the record without adversely affecting its content. These experiences allow us to examine unexpected record features and to assess more closely the ways in which records are created. As a result, we can formulate advice on the creation of records and use of record-creating applications, putting preservation concerns in place from the beginning of the records continuum. The set of Microsoft Word experiments showed that generally, migration over generations was at least as reliableand in some cases more reliablethan migration through generations. This may counter some of the scepticism about the costs of migration. Sceptics have argued that the recurring costs of migration will be too great to bear. Our results so far have shown that migration to each new version of an application should not be necessary, and we hope that experiments with other word processing applications will allow us to extend this hypothesis. The archival regulations of the Netherlands state that ministries are responsible for the authentic retention of their own records for the first 20 years, after which time a percentage of them are sent to the National Archives for long term archiving. The rest can be disposed of according to disposition regulations. It may be the case that Ministries can simply retain the documents in their original format, with maybe one or two controlled migrations, until the twenty years have passed. The Archives can then undertake more suitable long-term action concerning the current and future formats of the record. This is simply one possibility that we are considering. We still have many more experiments planned, and are waiting until all of the results are in before we release full advice on the long-term preservation of records produced by government agencies. The combined results will allow us to weigh alternative approaches, in conjunction with metadata and authenticity requirements, and determine the best way to implement these approaches. There are many different ways to carry-out the same task and it is unlikely that a "one size fits all" approach will be suitable for different record types with different retention requirements. Products, Tools, and Deliverables In addition to the research results, the Digital Preservation Testbed will also develop a more concrete set of tools and products. These include the Testbed Research Database, which supports our experiments and acts as a valuable source of knowledge on many aspects of digital preservation. This database is being built over the course of the project and aims to collect and provide commentary on relevant digital preservation literature. The contents are not limited to publications, but also include public listserv messages, Web sites, presentations, and Testbed references. Wherever possible, we have gathered electronic copies of the research documents and stored them in the Testbed system for reference by the project team. These are supplemented by online resources (for which the URLs are checked on a regular basis) and printouts. The research database is easily searchable and will be a valuable record of the project, as well as a useful resource. An abridged version for online resources only is available on our Web site. Other deliverables include white papers on each of our preservation strategies. The white paper on migration was published in December 2001, and the XML for preservation paper is scheduled for publication in late summer 2002. These white papers aim to provide a synopsis of current knowledge about each preservation approach, and to delineate ways in which the approach can be implemented. We will also deliver a technical report on the Testbed system itself, for which an extensive set of functional requirements is being developed. The Testbed Newsletter is produced quarterly and the Web site is updated monthly. Thus far, the Testbed has completed groups of experiments on emails and text documents and is issuing preliminary advice on the short and long-term preservation of emails. Work will soon commence on spreadsheets and databases, for which advice will also be released. The Testbed project is due to finish in October 2003 but preservation research is unlikely to stop there. The Digital Longevity Program will continue to run and coordinate digital recordkeeping and archival efforts for the Government of the Netherlands. Footnotes (1) Jeff Rothenberg & Tora Bikson: Digital Preservation. Carrying Authentic, Understandable and Usable Records Through Time. (The Hague, 1999) (back) (2) The Testbed and the Taskforce are also part of the ICTU, a non-profit organisation established by the Dutch government to house their e-government projects, including PKI (Public Key Infrastructure) and advies overheid (monitoring and advising on government Web sites at every level). See http://www.ictu.nl for further details. The close proximity of these projects allows them to easily collaborate and share information. (back) (3) Testbed Digitale Bewaring white paper, Migration: Context and Current Status (The Hague, 2001). (back) (4) See for example Kees van der Meer et al, Emulation and Conversion: Organisational and Architectural Overview of an Electronic Archive (Technical University of Delft and Utrecht University, 2001). This can also be seen by comparing current literature from around the globe dealing with preservation strategies and enforcement. (back) (5) See the Testbed White Paper on Migration (op cit) for a more extensive discussion on each of these approaches. (back) (6) Rothenberg & Bikson, op cit. See also Jeff Rothenberg: An Experiment in Using Emulation to Preserve D igital Publications NEDLIB Report Series (The Hague: NEDLIB consortium, 2000) and Avoiding Technological Quicksand: Finding A Viable Technical Foundation for Digital Preservation: A Report to the Council on Library and Information Resources. (Washington, D.C.: Council on Library and Information Resources, 1999). (back) (7) The Testbed uses two sorts of records in its experiments. The first are model records, which are created in the Testbed to examine and evaluate the effects of preservation action on specific record features (e.g., user-defined and automatic fields, templates, font and paragraph formatting, and signatures). The advantage to using our own records for this purpose is that we know exactly which features are present in each record and where. This allows us to carry out highly focussed experiments on record attributes. It also serves as a good starting point for any new round of experiments. The second sort of records includes test records. These are obtained from ministries and other government organisations, and are used in the larger-scale experiments that address the fundamental research questions and authenticity requirements. (back) (8) This section is intended to give the reader a flavour of the types of results we have gathered so far. Future reports from the Testbed will discuss our results to a greater extent than this introductory article. (back) (9) The exact position of the text on the page can be affected by as small a thing as changing the printer or printer driver. This is not a change that has affected the authenticity of the records in any of our experiments to date, but it has resulted in documents containing several more pages than they did originally, especially if page breaks have been employed. (back) Highlighted Web Site
Where are they now? Digitizing Microfilmed Newspapers We often read about new projects and programs in RLG DigiNews, but what about past efforts? What results have been produced in the five years since RLG DigiNews began publishing? Introduction In this issue, we continue a feature started in our five-year anniversary issue (April 2002) and take another look at projects originally reported on during our first year of publication. In August 1997, an article in RLG DigiNews by Alan Howell discussed three International Newspaper Film Scanning Projects. The projects profiled were the scanning of the Burney Collection at the British Library, the Caribbean Newspaper Imaging Project of Cuban and Haitian newspapers at the University of Florida, and the scanning of mid-19th century periodicals and newspapers for the Australian Cooperative Digitization Project (ACDP). As source material for digitization, newspapers are amongst the most challenging. Large page size, tiny type, halftone images, haphazard layout and poor bibliographic control are commonplace. Newspapers scanned from microfilm introduce additional difficulty in terms of image quality, processing, and sometimes cost. These factors complicate capture, delivery and access control and lead to a variety of technical and management obstacles. Given the technological advances that have occurred in the past five years, we thought it would be instructive to revisit these projects to learn where they are today, what lessons have been learned, and how the state-of-the-art has changed. We were able to obtain updates on two of the three original projects, the Burney Collection scanning initiative and the Australian Cooperative Digitization Project. The Caribbean Newspaper Imaging Project (CNIP) is in the midst of a "technological renovation" (migrating from CD-ROM to Web) and was unable to meet our publication deadline. However, project reports from two earlier phases are available on CNIP's Web site (1). Scanning the Burney Newspaper Collection at the British Library Contact person: John Goldfinch, Early Printed Collections, The British Library The Burney collection consists of 700 volumes of 17th-, 18th-, and 19th-century newspapers. It is especially prized for its coverage of 18th-century London newspapers, including many unique items. Owing to its age and condition, the originals are not available for study, and the popular collection is currently viewable only on microfilm. The original digitization work on the Burney collection was actually a British Library (BL) experiment that started in 1992 with an effort to determine the utility of its Mekel microfilm scanner in digitizing the library's microfilm holdings. From 1993 to 1996, BL attempted to learn what digital technology could do for it, as part of its Initiatives for Access program. The Burney collection was chosen for testing in part because of the challenges it presented. The original documents varied substantially in density, both across and within pages, and much of the type is broken, leading to film described by Hazel Podmore in BL's write-up of its experiment as "not the best-quality the Library has ever produced (2)." As expected, digitizing the Burney film proved difficult. According to John Goldfinch, the principle difficulties stemmed from the "expense and availability of high capacity storage for the files being created, together with the level of manual intervention required to deal with such things as deskewing the images, and coping with the highly variable print quality of the originals and deficiencies in the film." ![]() In fact, Podmore noted that storage considerations led to most scanning being carried out at a less-than-optimal 200 dpi. Another technical obstacle, considered insurmountable at the time, was the inability of any available Optical Character Recognition (OCR) package to produce acceptable machine-readable text for searching and indexing. At the time, the difficulties were so significant the British Library concluded that it could not justify continuing digitization of the Burney collection. BL feels that much has changed in the intervening years. Storage costs have dropped by about two orders of magnitude (about 100 times). John Goldfinch asserts that "recently demonstrated developments in OCR technology offer the exciting prospect that OCR is at last able to cope with the difficulties of early type." BL and other institutions have gained considerable experience in the scanning of microfilm of early printed material. As a consequence, BL is revisiting the question of the Burney newspapers, and has received a grant from the National Science Foundation to begin creating a fully searchable on-line library of British 18th century newspapers. As it had originally hoped to do in the mid-1990s, BL now plans to produce a complete set of images along with an index of titles with issue dates and numbers, and to make the complete collection freely available to researchers over the Web. A precise release date is not yet available. Australian Cooperative Digitization Project, aka the Ferguson Project Contact person: Ross Coleman, Collections Coordinator, University of Sydney Library The Australian Cooperative Digitization Project is a collaboration involving the University of Sydney Library, the State Library of New South Wales, the National Library of Australia and Monash University Library, amongst others. Ross Coleman and Colin Webb summed up the project's purpose as the "enhance[ment of] literary and historical research on nineteenth-century Australia by providing improved access to, and preservation of, scarce primary material confined to a few major library collections (3). The material selected for digitization, based on Ferguson's Bibliography of Australia, is confined to a critical six-year period in Australian history (1840-45). Though preceding the Australian gold rush, these materials represent a historical gold mine and provide a defining record of a distinct Australian colonial culture. ACDP was carried out as a true hybrid project, insuring that preservation quality microfilm existed or would be created for every title, and that stringent quality control guidelines governed the creation of digital images from the film. Although conceived as an experiment with a strong mandate to develop policies and procedures that could be applied to other Australian digitization efforts, ACDP also carried significant production expectations. Sixty-seven periodical titles (including newspapers) and four novels were ultimately digitized. In providing a retrospective to RLG DigiNews, as well as in previous summaries of the project (see http://www.nla.gov.au/ferg/about/ for several references) Ross Coleman and his colleagues have been unusually candid about the obstacles they encountered in bringing ACDP to fruition. Particularly noteworthy, and of value to anyone contemplating a digitization program of any size, is the excellent discussion in Webb and Coleman of the tug-of-war between workflow-enhancing automation procedures and the need to maintain sufficient quality for text legibility and successful processing via OCR (4). For example, most of the existing film was not of sufficient quality for digitization, and even the refilmed material presented some barriers to fully automated digital capture. As is so often the case in library and archive projects, quality requirements ultimately ruled, but a heavy price was paid in terms of missed deadlines, tense vendor relations, loss of staffing continuity, and frayed nerves.
| |||||||||||||||||||||||||||