![]() |
||||||||||||||||||||||||||
| August 15, 2002, Volume 6, Number 4 |
ISSN
1093-5371
|
|||||||||||||||||||||||||
|
Digitizing Historic Newspapers: Progress and Prospects Marilyn Deegan Emil Steinvel,
Olive Software Edmund King,
British Library In the last issue of DigiNews, Richard Entlich (1) presented a fascinating update of some projects to digitize newspaper content from microfilm, which had last been reported upon some five years previously (2). Entlich concluded the piece by stating that "newspapers continue to push the limits of current digital capture, image processing, OCR, and Web delivery technologies." One of the projects originally surveyed in 1997, the Caribbean Newspaper Imaging Project of Cuban and Haitian newspapers at the University of Florida, was unable to provide updated information for the Entlich piece, but in a comprehensive and extremely useful report on the digitization of newspaper content from microfilm made available in 2001, the project concluded that, for such materials "there is still no good, cost effective means of providing the researcher with full text or connecting story lines broken by column and page breaks."
Although it is true that newspaper content is extremely challenging for many different reasons, the cost-effective creation of usable and searchable digital content that offers users a realistic experience of the richness of newspapers is perhaps closer than we have hitherto thought. A number of libraries and academic institutions, together with Olive Software and OCLC, have made significant volumes of newspaper content available for full-text searching over the Internet, using automated processes developed by Olive Software. The Importance of Newspapers The desire to be informed, and to be informed speedily, about local and remote happenings seems to be a basic human need. Methods of disseminating news rapidly have been developed at all periods and in all cultures both literate and oral. In 490 BC, legend has it, Phidippides, the first ever Marathon runner, completed his 26 mile run from Marathon to Athens in around 3 hours to announce victory over the Persians and promptly died from exhaustion. News passed through oral transmission is only as durable as the memories of those who transmit or hear it, but news recorded on some kind of medium has a life beyond the immediate purpose of imparting information rapidly: it becomes part of the historic record. And there is no other medium in our history that records every aspect of human life over the last 300 yearson a daily basislike newspapers. The information contained in newspapers is, however, considered by its creators as essentially ephemeralimportant today, discarded tomorrowand so they print it on paper which is produced with cheapness in mind, rather than survival. As newspapers have developed over the last three to four centuries they have become increasingly complex. The desire to be informed creates a huge market for those who want to inform, so, as well as news, newspapers now also have pictures, comments, reviews, advertisements, listings, recipes, and increasing numbers of supplements, each of which has its own complexity. All of this is important, for there is a huge social history recorded in even the smallest of articles or advertisements. For instance, a search in the British Library Newspaper Pilot (see below) for the word "cigarette" will reveal a very different attitude to smoking than that which prevails today. The Weekly Dispatch of 1 July 1917 appeals to the British public to help keep the hospitals at the front well supplied with tobacco, for "No wounded man ought to ask for a smoke in vain. It is our privilege to keep him supplied."
Newspapers and Libraries The huge value of newspapers as part of the historic record has always been recognised by libraries, of course, and there are millions of miles of newsprint stored in libraries all over the world. But newspapers present huge problems of preservation and access: they are large in format, prolific in output, and there has been grave concern for decades about the survival potential of historic newspapers, given that many of them were printed on acid paper. Major libraries such as the Library of Congress in the USA and the British Library in the UK have been microfilming newspapers for many decades in order to preserve the historical record as well as, or instead of, preserving the objects. But there is also concern about the preservation status of microfilm not produced and stored according to standards. Libraries have come under fire for microfilming some titles and then disposing of the originals, but, given the continually increasing problems of storage and funding, what are librarians to do? The fate of newspapers has leapt into prominence over the last two years with the controversies caused by Nicholson Baker and others about selection and retention policies in the UK and the US. (See References) Never has there been a better time to think about some new ways of preserving and delivering newspaper content to traditional and new audiences. It takes dedicated researchers to handle broadsheet-sized bound volumes of crumbling paper, or miles of microfilm, especially when most newspapers are minimally indexed. What makes newspapers such a unique resource is what also makes them so difficult to manage. Extracting content from the text of newspapers without presenting all the information around it, as well as the layout and typographical arrangement, is an impoverishing exercise, and clippings without context are bound to lose some meaning. In historical perspective, too, those aspects of newspapers that are often ignored day-to-daysuch as advertisingbecome a huge source of social, economic, political, and cultural information. But researching newspapers requires diligence and often serendipity, and many scholars and others have spent years in libraries searching through unindexed bound volumes and microfilms. Given the importance of newspapers to our daily lives, finding some way of unlocking the content could create an interest in their historic value for many new audiences, including students, school-children, and anyone interested in the multifarious facts, opinions, products, and stories contained therein. Digitization of Newspapers The capture of newspaper content as image files is now possible with modern technologies. In particular, digitization from microfilm has been shown to be fast and cheap, giving relatively good results. It is also possible to create acceptable content from compromised originals, with the resultant files being digitally "cleaned" for better readability. Creating searchable content is a much more difficult process, given the complexity of the newspaper page and the mixed media formats, with text, images, advertisements juxtaposed and interspersed in order that maximum content can be accommodated in the minimum space. Stories run across widely separated pages, too. The complex structure of newspapers also changes over time and between titles. Early attempts at Optical Character Recognition (OCR) failed because the quality achieved was too poor for adequate retrieval (and correction too costly) and because the OCR engines operated on linear text, not individual content objects. The structural unit of the page was recognised, not the logical unit of the item. Other problems for OCR (especially with microfilmed content) include curved or rotated lines due to tight bindings, and "noise" or garbage elements, which can be caused by microfilm deterioration, dirt on the scanner, or imperfections in the original, including broken lines, scratches from overuse of the microfilm, and broken characters. Manual indexing and rekeying offer much better possibilities than conventional OCR, but are too costly for libraries, and are probably too costly even for most of the newspaper publishers themselves, though some are producing digital newspaper archives using manual methods (3). The Olive Software Approach The problems outlined above are severe, and no technology can compensate for them fully, particularly since scanned microfilm pages often suffer from many of these problems at the same time. Olive Software's PIPEX digitization technology and ActivePaper Archive offer a new approach to the digitization of newspapers, using specially-developed algorithms capable of dealing with most combinations of the above-mentioned OCR problems, and also using new technologies for the "zoning" of content into logical as well as structural components. The Olive process, developed in partnership with OCLC for libraries wishing to create online newspaper archives, recognises that for the best end results, every step in the capture and delivery process has to be carefully controlled, and therefore offers an "end-to-end" solution which utilizes library standards of digitization and metadata creation. The Olive Software solution has been developed over the last seven years, starting life as the smart image software, "Newsware," developed by IOTA, Inc., in the mid 1990s for the Palestine Post project. Early versions of the software were based upon the recognition of words within the structural page units, and the "smart image" component of the software allowed the hits to be highlighted on the page by mapping the co-ordinates of each word and storing this information alongside the OCR information. However, as Dr Ronald, Director of the Palestine Post project points out,
Zoning, OCR, and the Creation of an XML Repository ActivePaper Archive has now been used to build a number of newspaper archives. The examples used here are taken from the British Library Newspaper Pilot carried out by the British Library, Olive Software, and OCLC in 2001. The goal of the project was to allow for online accessibility to the British Library's historic content. Such accessibility was to be made possible through a process of digitization, divided into two main parts, precision scanning and image processing, with a digital newspaper archive being generated from the processed images. OCLC Preservation Resources executed the precision scanning of microfilmed pages of British Library Newspapers (18 reels of duplicate negative microfilm) to TIFF format, at its Conversion Plant in Bethlehem, Pennsylvania. These files were then shipped to Olive Software's microfilm digitization production facility in Israel for processing using their PIPEX system, in order to produce the digital archive from the TIFF files. A large team was involved at all stages of the project including the British Library, OCLC, and Olive Software staff. In addition, the Malibu hybrid library project staff at King's College London and Oxford University were involved in the initial inception of the project, and in its design, implementation, evaluation and promotion. (5). The Digitization Concept The project's digitization concept, which is at the core of PIPEX and ActivePaper Archive software, was developed with two primary aims: to make digitization practical (significantly reducing time and cost as a result) and to enable high quality access to historic materials. One PIPEX machine uses a 96 CPU computer architecture to perform advanced parallel image processing on each scanned page. One PIPEX based production line has a monthly capacity of 1.2 million pages, delivering approximately 12 million segmented and tagged articles and photos per month. Olive/OCLC currently runs two PIPEX production lines, with a third scheduled to come on line in late 2002. Olive and OCLC year-end capacity will be about 3.6 million pages, delivering 36 million individual tagged items. Separating "Readability" from "Searchability" "Readability," defined as the user's capacity to view and comprehend historic text, and "searchability," defined as the user's capacity to reach relevant content through provision of search criteria, can be said to be the two components of "accessibility," or the user's capacity to retrieve and read relevant content. Both readability and searchability are key goals of any digitization effort. In the past, it was thought that text generated by OCR (Optical Character Recognition) could provide both readability and searchability. Due to the difficulty of extracting high-quality text from historic scans, this approach is now considered to be impractical. ActivePaper Archive is among the first technologies based on, and enabling, separation of readability from searchability. Readability in ActivePaper Archive ActivePaper Archive achieves readability by enabling the user to read directly from images instead of from the OCR-generated text. The task of comprehending the degraded text is performed by the human eye and brain, the best possible OCR engine. This is an effective solution to the problem, albeit one that is not simple to achieve. In newspaper material, in particular, it is not practical to provide the online user with readable page images. These would have to be large, high resolution image files, prohibitive both in terms of screen real-estate and download time. This suggests the use of smaller imagesof articles, and the elements comprising themto deliver content to the user. To provide this capacity cost-effectively, ActivePaper Archive uses an image processing technique called "segmentation," which breaks the page down into its smaller information units (articles, pictures and ads, and their components), identifies them, and infers the relationships between them. Using artificial intelligence and a patented bitmap indexing and image search technology, the software attempts to overcome the formidable obstacles of poor image quality and complex page layout, both very common features in historic microfilmed newspapers. Searchability in ActivePaper Archive For searchability, ActivePaper Archive relies on OCR-generated word patterns, stored in XML format. The software uses APFSAdaptive Probability Fuzzy Search (patent pending), a fuzzy logic search technologyto compensate for text inaccuracies by applying fuzzy logic according to the probability for error in each word-pattern. Blindly applying fuzzy logic to an entire archive of corrupted text results in large numbers of irrelevant results. This is not recommended in a microfilm setting. APFS applies fuzzy logic only when needed, providing highly relevant results users would not otherwise get. To support the APFS engine, ActivePaper Archive employs special OCR techniques developed to solve the specific problems of microfilmed historic materials. These enable reasonable OCR accuracy, even in very degraded pages, to further enhance the searchability factor. In addition, the technology produces "word patterns" instead of simple ASCII text conversion. These word patterns include the actual characters making up a word, graphic characteristics of the word, and an encoded error probability parameter. Bridging Layout and Structure, Images and Text The link between the searchability factor and accurate readability is provided by a patented technique called Bitmap Indexing. This technique allows for indexing of each meaningful group of pixels (containing a page element like an article title, a body text word-pattern, or a picture) on the page image. Having a digital index that points to these valuable image elements enables direct access to, and sophisticated manipulation of, image "clips" instead of cumbersome page images. Bitmap Indexing results in meaningful end-user features. For example, search hits can either be highlighted in an article image, and search results pages can display scaled images of article titles, not corrupted OCR text; or they can bring the first text body paragraph, which offers readable results for true searchability. In ActivePaper Archive, newspapers and documents run through the PIPEX image processing stage are converted to ActivePaper XML. Traditionally, XML holds text and its structure, but ActivePaper goes further by tying the XML to images. The product uses three XML layers - one based on the NewsML/NITF standards, one on the Dublin Core, and a third on PRML, or Preservation Markup Language. PRML maps the newspaper's layout, recording coordinates for each piece of text and each page object (6). The first two layers, containing industry-standard tags, make certain that the archive is based on an open, integrative platform, while unique PRML tags lay the basis for Bitmap Indexing and APFS. Work is currently being done to make the DTDs interoperable with library standard XML DTDs such as METS, TEI, and EAD. The archive functions as a dynamic XML repository. The results of image processing (XML files and images) are organized in a logical file-system hierarchy. This provides great flexibility, as the archive can very simply be distributed over multiple hard-drives or storage media. It also avoids the use of database systems, which do not fare well when faced with the volume and complexity of digital newspaper archives. But most importantly, the XML repository can be accessed directly by a Web browser, using XML style sheet technology. Potential for True Online Accessibility As evidenced by the results of the project, this conceptual and technological shift from previous visions of digitization means that, for the first time, the technology provides the potential for true online accessibility to large quantities of historic materials with complex content like newspapers. Building
the British Library Demonstrator
2. TIFF image pre-processing and binding
Next, TIFFs were named according to their page number, issue date, and publication name, and the images were optimized. Since generic algorithms may damage microfilmed images, much research has gone into Olive's automatic image cleanup and alignment procedure: a microfilm frame may contain one or two openings, or may have overlap of a fragment of a neighbouring opening (as in the image above). The Olive system (patent pending) automatically separates out the individual page, and deskewing and cleanup of each page is then performed. Different cleanup methods are used for text, images, and margins. 3. Page zoning
Here, the page image has been analysed to find horizontal and vertical lines, text strings, and picture regions. Then, working like a human eye that views a newspaper page from a distance, the zoning engine uses these lines and shapes to analyse the geometry of the page. It builds a net of image objects, examining alignment, size, brightness, and other characteristics of groups of elements on the grid. The result is a rough page structure definition, which includes text regions, classified as body text or titles. 4. OCR 4 OCR was performed on each of the text regions detected in image analysis. The results of OCR were written into a PDF, overlaid on page images, together with detailed information about word coordinates, font, and size. 5. Segmentation
In this stage, all the information gathered in image analysis, layout analysis and OCR is put to use. The segmentation engine analyses textual objects and their optically-recognized text to find page objects like articles, pictures, and ads, their components, and the relationships between them. This structural information is also written into the PDF. 6. Output
to ActivePaper XML Component Format Example
7. Building the Demonstrator Having scanned the images and processed them to create the XML repository, an experimental Web site was built. This Web site links to the opening "portal" of the repository, which physically resides on an ActivePaper Archive server installed at King's College, London.
A powerful and flexible search engine embedded in the Olive system allows users to perform Boolean searches on the entire repository of more than 200,000 items. Searches can also be restricted by date or newspaper title, and can be further refined by exploiting the XML structure of the repository by searching only within articles, advertisements, or pictures. Further precision can be obtained by searching for individual elements ("title," "byline," etc.) within items. Search results can thus display "snippets" of the newspaper page: article titles, the first few lines of text, image captions or advertisements, so that the results are meaningful at a glance. Clicking on the snippet opens a window displaying the whole item, and from there the user can navigate to the item's position on the newspaper page. It is also possible to navigate the archive by newspaper title and date, just as in a traditional archive. Conclusion With Olive Software's technology, the dream of low-cost, fully automated digitization and delivery of historic newspaper content has been achieved, offering libraries new possibilities for increasing access to a greater range and number of potential users. The technology can also be used for the development of searchable archives of other kinds of documents, as for instance has been shown by the development of the Forced Migration Online Digital Library, which contains some 3,000 items (c.70,000 pages) of grey (unpublished) literature on all aspects of refugee studies. Acknowledgements (1) Richard Entlich, FAQ: Where are they now? Digitizing Microfilmed Newspapers, RLG DigiNews, June 15, 2002, Volume 6, Number 3. [back] (2) Alan Howell, Film Scanning of Newspaper Collections: International Initiatives, RLG DigiNews August 15, 1997, Volume 1, Number 2. [back] (3) See, for instance, http://www.bellhowell.infolearning.com/proquest/histdemo/ [back] (4) Ronald. W. Zweig, Retrieving Text from Digital Images: Lessons from the Palestine Post Project, http://kipp.tau.ac.il/lessons.htm Solving the Problem of Access – Only to Drown in the Details: Problems in Newspaper Retrieval Systems, http://kipp.tau.ac.il/update.htm [back] (5) There are further details about the British Library Newspaper Pilot at www.uk.olivesoftware.com/conference. [back] (6) PRML was developed by Olive Software. OCLC is working with Olive to standardize PRML. Olive will provide a copy of the draft specification upon request. Contact Emil Steinvel for further details. [back] References [back to text]Baker, N. (2000) Deadline: the Author's Desperate Bid to Save America's Past, The New Yorker (24 July). Baker, N. (2001) Double Fold: Libraries and the Assault on Paper, Random House Trade. Cox, R. J. (2000) The Great Newspaper Caper: Backlash in the Digital Age, First Monday, 5 (12), http://firstmonday.org/issues/issue5_12/cox/index.html. Pearson, D. (2000) Letter, Times Literary Supplement (8 September).
Publishing
Information
|
||||||||||||||||||||||||||