FAQ

Where are they now? Digitizing Microfilmed Newspapers

We often read about new projects and programs in RLG DigiNews, but what about past efforts? What results have been produced in the five years since RLG DigiNews began publishing?

Introduction

In this issue, we continue a feature started in our five-year anniversary issue (April 2002) and take another look at projects originally reported on during our first year of publication. In August 1997, an article in RLG DigiNews by Alan Howell discussed three International Newspaper Film Scanning Projects. The projects profiled were the scanning of the Burney Collection at the British Library, the Caribbean Newspaper Imaging Project of Cuban and Haitian newspapers at the University of Florida, and the scanning of mid-19th century periodicals and newspapers for the Australian Cooperative Digitization Project (ACDP).

As source material for digitization, newspapers are amongst the most challenging. Large page size, tiny type, halftone images, haphazard layout and poor bibliographic control are commonplace. Newspapers scanned from microfilm introduce additional difficulty in terms of image quality, processing, and sometimes cost. These factors complicate capture, delivery and access control and lead to a variety of technical and management obstacles.

Given the technological advances that have occurred in the past five years, we thought it would be instructive to revisit these projects to learn where they are today, what lessons have been learned, and how the state-of-the-art has changed. We were able to obtain updates on two of the three original projects, the Burney Collection scanning initiative and the Australian Cooperative Digitization Project. The Caribbean Newspaper Imaging Project (CNIP) is in the midst of a "technological renovation" (migrating from CD-ROM to Web) and was unable to meet our publication deadline. However, project reports from two earlier phases are available on CNIP's Web site (1).

Scanning the Burney Newspaper Collection at the British Library
Contact person: John Goldfinch, Early Printed Collections, The British Library

The Burney collection consists of 700 volumes of 17th-, 18th-, and 19th-century newspapers. It is especially prized for its coverage of 18th-century London newspapers, including many unique items. Owing to its age and condition, the originals are not available for study, and the popular collection is currently viewable only on microfilm.

The original digitization work on the Burney collection was actually a British Library (BL) experiment that started in 1992 with an effort to determine the utility of its Mekel microfilm scanner in digitizing the library's microfilm holdings. From 1993 to 1996, BL attempted to learn what digital technology could do for it, as part of its ”Initiatives for Access” program.   The Burney collection was chosen for testing in part because of the challenges it presented. The original documents varied substantially in density, both across and within pages, and much of the type is broken, leading to film described by Hazel Podmore in BL's write-up of its experiment as "not the best-quality the Library has ever produced (2)."

As expected, digitizing the Burney film proved difficult. According to John Goldfinch, the principle difficulties stemmed from the "expense and availability of high capacity storage for the files being created, together with the level of manual intervention required to deal with such things as deskewing the images, and coping with the highly variable print quality of the originals and deficiencies in the film."


In fact, Podmore noted that storage considerations led to most scanning being carried out at a less-than-optimal 200 dpi. Another technical obstacle, considered insurmountable at the time, was the inability of any available Optical Character Recognition (OCR) package to produce acceptable machine-readable text for searching and indexing. At the time, the difficulties were so significant the British Library concluded that it could not justify continuing digitization of the Burney collection.

BL feels that much has changed in the intervening years. Storage costs have dropped by about two orders of magnitude (about 100 times). John Goldfinch asserts that "recently demonstrated developments in OCR technology offer the exciting prospect that OCR is at last able to cope with the difficulties of early type." BL and other institutions have gained considerable experience in the scanning of microfilm of early printed material.

As a consequence, BL is revisiting the question of the Burney newspapers, and has received a grant from the National Science Foundation to begin creating a fully searchable on-line library of British 18th century newspapers. As it had originally hoped to do in the mid-1990s, BL now plans to produce a complete set of images along with an index of titles with issue dates and numbers, and to make the complete collection freely available to researchers over the Web. A precise release date is not yet available.

Australian Cooperative Digitization Project, aka the Ferguson Project
Contact person: Ross Coleman, Collections Coordinator, University of Sydney Library

The Australian Cooperative Digitization Project is a collaboration involving the University of Sydney Library, the State Library of New South Wales, the National Library of Australia and Monash University Library, amongst others.   Ross Coleman and Colin Webb summed up the project's purpose as the "enhance[ment of] literary and historical research on nineteenth-century Australia by providing improved access to, and preservation of, scarce primary material confined to a few major library collections (3).  The material selected for digitization, based on Ferguson's Bibliography of Australia, is confined to a critical six-year period in Australian history (1840-45).   Though preceding the Australian gold rush, these materials represent a historical gold mine and provide a defining record of a distinct Australian colonial culture.

ACDP was carried out as a true hybrid project, insuring that preservation quality microfilm existed or would be created for every title, and that stringent quality control guidelines governed the creation of digital images from the film.   Although conceived as an experiment with a strong mandate to develop policies and procedures that could be applied to other Australian digitization efforts, ACDP also carried significant production expectations. Sixty-seven periodical titles (including newspapers) and four novels were ultimately digitized.

In providing a retrospective to RLG DigiNews, as well as in previous summaries of the project (see http://www.nla.gov.au/ferg/about/ for several references) Ross Coleman and his colleagues have been unusually candid about the obstacles they encountered in bringing ACDP to fruition.   Particularly noteworthy, and of value to anyone contemplating a digitization program of any size, is the excellent discussion in Webb and Coleman of the tug-of-war between workflow-enhancing automation procedures and the need to maintain sufficient quality for text legibility and successful processing via OCR (4).

For example, most of the existing film was not of sufficient quality for digitization, and even the refilmed material presented some barriers to fully automated digital capture. As is so often the case in library and archive projects, quality requirements ultimately ruled, but a heavy price was paid in terms of missed deadlines, tense vendor relations, loss of staffing continuity, and frayed nerves.


As might be expected, newspapers proved particularly troubling, especially because of their size.   However, Ross Coleman identified additional obstacles stemming from "the variety within any one title, from foxing and discoloration, to the use of varying fonts and point sizes on the one page."   Coleman also acknowledges that newspapers require quality OCR in order to truly justify their digitization, since they generally lack even rudimentary indexing.  Unfortunately, marginal print quality and type size variation often thwart the creation of an accurate body of searchable text, even with current technology.   (For example, ProQuest Historical Newspapers™ reports 80-90% OCR accuracy for the article text from its New York Times microfilm.)

Despite having successfully completed its initial objectives, ACDP exceeded its anticipated resource consumption to such a degree that conversion of additional 19th century periodical titles has been put on hold.   Within the selection of periodicals, newspapers continue to be viewed as especially daunting targets for digital capture.   Coleman reports "the fact that no more have been done, or even contemplated, highlights the fact that—at the time—we were not confident in the technology, or our procedures, or in the effectiveness in delivering such things over the Web in a usable manner."

So while ACDP has succeeded in greatly expanding access to a corpus documenting an important slice of Australian history, it has not, as yet, provided the basis for expanded conversion of other materials from that period.

Conclusion

In revisiting these two projects, we encountered somewhat different perspectives about the current viability of digitizing, OCRing, and providing Web access to microfilmed newspapers.   One possible explanation for the differing opinions is the timing of the initiatives. ACDP started out as the Burney experiment (along with many other early digitization experiments) was wrapping up.   Burney was conceived as more of an experiment, and was carried out at a time when the technology was clearly not up to the task.   It was shelved until very recently, when the technology seemed like it might finally be able to tackle the challenge.

On the other hand, ACPD was conceived as a production enterprise, and was carried to completion despite knocking against technological barriers at several points.   Having only recently completed the mounting of files, ACDP is still reticent about taking on additional conversion, given the technological obstacles it encountered.

It is noteworthy, however, that what most distinguishes Ross Coleman's perspective from that of John Goldfinch’s has little to do with the technological underpinnings of the respective projects. Although both speak to the frustrations of digitizing challenging older materials, the most striking difference is in Coleman's emphasis on the obstacles created by management issues. Problems faced in the management arena remain underreported and under-discussed within digital imaging circles, compared to those in the technical realm.   Even as some (though by no means all) the technological barriers to effective large-scale digitization of older printed materials begin to fall, we would be wise not to downplay the ongoing challenges represented by funding, staffing, vendor relations, planning, and the like.

Perhaps the ultimate lesson from the experiences described above is that there is still no such thing as a large-scale, cookie-cutter digitization project.   Despite many successfully completed efforts and improved availability of training and documentation, the work remains technically complex, time-consuming, and expensive. Working from marginal source materials introduces additional complexities, and newspapers continue to push the limits of current digital capture, image processing, OCR, and Web delivery technologies.

Further reading

In addition to the references already given, here are some useful readings on recent newspaper digitization efforts:

The ProQuest Historical Newspapers™ project (backfiles of the Christian Science Monitor, the Wall Street Journal, the New York Times, the Washington Post and Canadian newspapers digitized by Cold North Wind ("practically every newspaper published in Canada from 1750 to 1950") with plans to add other national, regional and local publications).   The home page provides links to a slide show about the project.   An additional demo is also available.

OCLC Digital & Preservation Resources and Olivesoft digitization of historic newspaper collections (an initiative "to help libraries provide full online searchable access to their historic newspapers").   Read the press release for this collaboration and read about Olivesoft's ActivePaper Archive™ software.

The Nordic Digital Newspaper Library (Nordic Newspapers from 1640-1860). Read a paper by Majlis Bremer-Laamanen presented at the 2001 Annual Meeting of the United States Newspaper Program held at the Library of Congress in Washington, DC on April 26th 2001.

Digitisation of Newspaper Clippings: The LAURIN Project by Günter Mühlberger.   RLG DigiNews, v. 3, no. 6, December 15, 1999.

--RE


Footnotes
(1) Erich Kesse, Robert Harrell, Richard Phillips and Cecilia Botero, Caribbean Newspaper Imaging Project, Phase I: Imaging and Indexing Model and Phase II: OCR Gateway to Indexing. (back)
(2) Hazel Podmore, “The Digitisation of Microfilm” in L. Carpenter, S. Shaw and A. Prescott, eds., Towards the Digital Library (London, 1998). (back)
(3) Colin Webb and Ross Coleman, Digital conversion of Nineteenth century publications—Production management in the Australian Cooperative Digitisation Project 1840-45. LASIE, v. 31 no. 2, June 2000, pp.5-20. Also available in HTML. (back)
(4) Ibid. (back)

 

Publishing Information

RLG DigiNews (ISSN 1093-5371) is a newsletter conceived by the members of the Research Libraries Group's PRESERV community. Funded in part by the Council on Library and Information Resources (CLIR) 1998-2000, it is available internationally via the RLG PRESERV Web site (http://www.rlg.org/preserv/). It will be published six times in 2001. Materials contained in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given for the material in RLG DigiNews to be used for research purposes or private study. RLG asks that you observe the following conditions: Please cite the individual author and RLG DigiNews (please cite URL of the article) when using the material; please contact Jennifer Hartzell, RLG Corporate Communications, when citing RLG DigiNews.


Any use other than for research or private study of these materials requires prior written authorization from RLG, Inc. and/or the author of the article.


RLG DigiNews
is produced for the Research Libraries Group, Inc. (RLG) by the staff of the Department of Preservation and Conservation, Cornell University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern; Production Editor, Barbara Berger Eden; Associate Editor, Robin Dale (RLG); Technical Researchers, Richard Entlich and Peter Botticelli; Technical Coordinator, Carla DeMello.


All links in this issue were confirmed accurate as of February 14, 2002.


Please send your comments and questions to preservation@cornell.edu.