
ABSTRACT
Until recently, most newspaper digitization projects have been the province of large national or state organizations.1,2 Improvements in Optical Character Recognition (OCR) technology, the appearance of service bureaus specializing in the digitization of historical newspapers, and the availability of open-source digital library software have brought costs down and made it both technically and economically feasible for smaller organizations to undertake such projects.
This article describes how the Tuzzy Consortium Library, a small regional library in a very remote location (Barrow, Alaska) successfully undertook the digitization of the Tundra Times, a statewide newspaper that documents the history of Alaska Native peoples and their political struggles from 1962 to 1997.
The Library adopted and extended the methodology developed by the Utah Digital Newspapers Project.3 In two significant departures from the Utah approach the Library used contractor-supplied workflow-processing software that enabled the segmentation and metadata markup of the newspaper articles to be done locally, in Barrow. The Library also used Greenstone, an open-source digital library software suite rather than a commercial product, to publish the newspaper on the Internet.

Figure 1: Tundra Times website
BACKGROUND
The Tundra Times was a statewide, weekly and bi-weekly English-language newspaper whose primary audience was the Alaska Native population. It published from 1962 to 1997—a period of immense change for the Native peoples of Alaska and indeed the state as a whole. The newspaper reported on events that transformed the Native way of life, including settlement of land claims, founding of Native corporations, and the transfer of health and social services to Native-operated nonprofit organizations. The Tundra Times records these struggles for self-determination and compensation from an Alaska Native perspective and, as such, is an important resource for Alaska Natives, Native peoples worldwide, and researchers.
In 1997, after a long financial struggle, the Tundra Times ceased publication. The Ukpeagvik Iñupiat Corporation of Barrow acquired its archives and copyrights, and a year later, the collection was turned over to the Tuzzy Consortium Library. The Library serves as the academic library for Ilisagvik College and the public library for the North Slope Borough (NSB) of Alaska. Bordered on the north by the Arctic Ocean and on the south by the Brooks Range, the NSB is the northernmost organized municipality in the United States, lying entirely above the Arctic Circle. A major part of the Library’s mission is to disseminate collections such as the Tundra Times newspaper directly from Barrow.
In October 2002 the Library received a $150,000 two-year grant from the Institute of Museum and Library Services’ (IMLS) Native American Library Services program to digitize the microfilm of the 35-year run of the Tundra Times and make its articles accessible. Previously, this important cultural and political resource was rarely used since it had never been indexed and was available only on microfilm. Seventeen years of the paper are now Internet accessible.
DIGITIZATION PROCESS
In order to make a newspaper available for searching on the Internet, the following processes must take place: (1) the microfilm copy or paper original is scanned, (2) master and Web image files are generated, (3) metadata is assigned for each issue, page, and article to improve the searchability of the newspaper, (4) OCR software is run over high resolution images to create searchable fulltext, and (5) OCR text, images, and metadata are imported into a digital library software program.
With approximately 27,000 pages to process, one technician dedicated to the project and a tight two-year grant-funded timeframe, it was critical to determine which processes should be done in-house and which should be outsourced. From the outset it was clear that the scanning of the microfilm should be outsourced because this requires expensive and specialized equipment and considerable expertise to obtain the best results. The quality of the images significantly affects the accuracy of the OCR processing so this is a critical step. The service bureau that scans the microfilm normally also generates the derivative Web image files.
We searched for newspaper digitization models that would give us guidance on the remaining steps in the process: OCR processing, selection of digital library software, and assignment of metadata. We looked for approaches that were affordable and could be implemented by a small library with limited technical resources.
During the grant writing process we were fortunate to receive input and support from the Hawaiian Newspaper4 and Maori Newspaper5 projects. Both projects used ABBYY FineReader, a commercial OCR program, and did almost all processing in-house. Initially they experienced a very slow rate of progress due to the complexity of working with historical newspapers, the time needed to train the OCR dictionary to recognize Maori and Hawaiian alphabets, and because the OCR output required a great deal of manual error correction. Both projects used Greenstone open-source digital library software to publish their newspapers on the Internet. Our IMLS grant application was based on their approach.
Since the inception of the Maori and Hawaiian Newspaper projects there has been considerable improvement in OCR technology, and service bureaus have appeared that specialize in generating OCR text from images of historical newspaper articles. The J. Willard Marriott Library at the University of Utah prototyped a viable and affordable approach to newspaper digitization. They used a service bureau to scan and process newspapers to produce image files and XML-tagged metadata and a commercial digital library software program to publish the newspaper on the Internet.3 The Utah model offered clear advantages and prompted us to change direction from the more labor-intensive method used by the Hawaiian and Maori newspaper projects.
NEWSPAPER DIGITIZATION SERVICE BUREAU
We determined that the best approach for the Tundra Times project would be to outsource not only the scanning of the microfilm and generation of the derivative image files, but also the OCR processing and generation of XML files of the OCR text and metadata as these processes could be done more efficiently and less expensively by a service bureau.
The Library contracted these services to iArchives, Inc. They offer a patented digitization and OCR technology that has been used successfully in other newspaper projects, including the University of Utah’s. We selected iArchives as our vendor because they were cost effective, had a proven track record, and were willing to work with our smaller print run.
Use of iArchives’ Web-based workflow management software (WFM) (described in detail later) was an important part of our digitization strategy as it enabled the archival technician to work in Barrow. The WFM also enabled us to meet two other important project goals: retention of local control of the process and provision of training in advanced library technologies to the archival technician.
DIGITAL LIBRARY SOFTWARE
Selection of the digital library software that would provide access to the fulltext of the newspaper and images of articles was a critical step. Many software packages are available, including: Greenstone, the University of Michigan’s DLXS (free or low cost to non-profit institutions), and the commercial products CONTENTdm, ENCompass for Digital Collections, MS Content Management Server, or a custom solution.
We determined that Greenstone was the best way for us to publish the OCR text and page images on the Internet. Greenstone was developed by the New Zealand Digital Library Project and is issued under the terms of the GNU General Public License.6 It is a mature, respected, and widely used piece of software that has been under development for over five years, is stable, and well supported. Greenstone is designed to be portable to almost any operating system and requires only very modest system resources, meaning that usually it can be installed on existing hardware and doesn’t require a dedicated server. It is highly configurable and the user interface can be customized with relatively little effort. Greenstone has good multilingual support and works well for both searching and displaying non-Latin character sets. Although the Tundra Times is written in English, this facility is important because many personal and place names and native languages terms appear in it.
Greenstone met all our technical requirements, and because it is open-source, we were able to realize considerable savings compared to the cost of commercially available software. We used $2,000 to hire a consultant (DL Consulting Ltd.) to enhance Greenstone’s search and display capabilities and to create the XML plug-in, and anticipate spending an additional $1,000 to resolve issues discovered since loading data.
Greenstone is capable of accepting any number of input file types, however, we confirmed that formatting the OCR output file in XML would provide the most flexibility and portability. Development of the XML plug-in needed to implement this functionality required close cooperation between our digitization vendor and the Greenstone consultant. The XML plug-in stores all the information necessary to enable search term highlighting within the PDF Searchable Image files (formerly PDF Image+Text), a feature users have come to expect.7 Implementing this highlighting required complex customization of the Greenstone runtime system. Unfortunately, this feature works only under Windows with Internet Explorer and Acrobat Reader.
We requested an enhancement allowing searches to be limited by date range and article type, which we felt would benefit users dealing with large amounts of fulltext. We also intend to customize the “look” of the Greenstone interface by using the newspaper masthead instead of the generic Greenstone graphic.

Figure 2: Tundra Times Greenstone Search Interface
METHODOLOGY
Microfilm Scanning
The best microfilm for image projects is an unused, negative, first generation copy of the master. Although the negative was unavailable, we were very fortunate to receive an unused copy of positive microfilm from the Alaska State Library.
iArchives’ microfilm scanning process is largely automated and can include a variety of processes designed to improve image quality such as deskew, dynamic threshold, despeckle, bleed-through reduction, curvature removal, image format conversion, and scaling.
Microfilm rolls are scanned in batches, as the technician completes a work unit. Each roll contains approximately 700 page images, which are scanned at 400 dpi (some earlier images were scanned at a lower resolution). The master image is a cropped, de-skewed, 8-bit grayscale TIFF master scan, averaging 34 MB.
Image processing is always a trade-off. When an algorithm is applied to enhance an aspect of an image or eliminate a defect in a page (for example, bleed-through), the algorithm may at the same time adversely affect some other aspect of the image. Quality control steps are built in at critical points; however, the final quality control occurs when the customer reviews the delivered images.
We did discover variable quality in the microfilm, which naturally affects the quality of the images produced. In one instance the right side of the page is consistently darker than the left, leading one to conclude that a light source for the microfilm scan table was burned out or turned off. Problems such as this occur only very infrequently, therefore, no image correction processes were run on the master scan and Web derivative images. However, a binarization algorithm is run on the 1-bit OCR version of the image, which corrects for a gradient such as this, producing images of a more uniform contrast and brightness. This ensures the image is optimized for the OCR process (the OCR images are discarded after the data has been delivered).
Image Formats & Resolutions
A number of decisions needed to be made about the master and Web images including file format (TIFF, JPEG, JPEG2000, or PDF Searchable Image), bit depth (grayscale or bitonal), and resolution. Lack of available tools at the time precluded JPEG2000 from consideration. iArchives provided test images in a variety of formats and resolutions to help us decide among the other options.
iArchives delivered three images derived from the 8-bit grayscale TIFF master scan: (1) 4-bit grayscale, 400dpi TIFF image (average file size 17 MB), (2) 125dpi, 8-bit TIFF images converted to JPEG page and article files for Web delivery (average file size 0.53 MB), and (3) JPEG thumbnails of full pages.
We debated the issue of file size versus image quality at length for the Web versions of the page and article images. When we compared black and white test images to grayscale images it was clear that the grayscale images significantly improved the fidelity of the text and photographs. However, grayscale article and page images are significantly larger than 1-bit black-and-white images (0.53 MB compared to 0.25 MB). We decided that it was more important to have higher quality images than smaller file sizes and chose grayscale images for page and article images for Web delivery. Given the trend to ever-higher communication speeds and increasing availability of broadband, we decided that larger file sizes would not long be a problem.
Our Web delivery format is Adobe’s PDF Searchable Image file. Each PDF file comprises a JPEG image, OCR text, and select metadata and averages 0.56 MB per page. Adding the text to the PDF adds 3-8% to the file size (“noisy” images generate larger text files and are at the 8% end of the range). Each word is inserted into the file at offsets specified by the OCR text bounding box coordinates in the XML file. Newspaper articles can extend across several pages and all parts of an article are contained in a single PDF file. PDF Searchable Image files are viewable and searchable by Adobe’s Acrobat Reader, a common plug-in that is well supported by Adobe. Packaging larger JPEG files within the PDF ensures the text can be read comfortably onscreen, even by users with visual disabilities (using the Reader’s magnify/zoom option).
Workflow Management System (WFM)
The WFM is the heart of iArchives’ service – it tracks the images, project specifications, metadata, OCR text, and quality aspects associated with multiple projects. Once a customer’s project specifications are determined, the WFM distributes work to computers hosting automated and manual processes. It is this distributed processing model that enables the archival technician to work on the project in Barrow even though the WFM itself resides on a computer in iArchives’ Utah data center. She requests work units (newspaper issues) from the WFM and marks up page level images with article information and metadata. This data is saved locally until it is completed, then it is transmitted back to iArchives via FTP.

Figure 3: Tundra Times Workflow
The distributed workflow is scalable and could easily be adapted to enable the use of volunteers to markup the paper and assign metadata, the most time consuming and therefore most costly components of the digitization process described here. Given the financial constraints many small cultural institutions operate under, this approach could be the key to undertaking a digitization project like the Tundra Times.
Metadata
The archival technician assigns both page level and article level metadata. Each page image is tagged with the following metadata: (1) publication title, (2) publication date, (3) volume and issue number, and (4) page number. The page image is then segmented into articles and the following article level metadata is keyed: (1) headline, (2) byline, (3) classification, and (4) whether the article is a lead story.
Even the best commercially available OCR engines do not achieve 100 percent accuracy and for most projects it is not economically feasible to manually error correct OCR output. Metadata are important because, when combined with the fulltext, they can improve search accuracy and enable searches to be constrained to particular sections of the paper. For example, classifications enable end users to search the editorials, news articles, cartoons, or advertisements, and/or to search for a particular byline.
OCR Processing
OCR creates searchable text from digital images. Deegan2 and Arlitsch8 describe the difficult issues associated with OCR processing of newspaper images. Many factors affect the accuracy of OCR. These range from the quality of the source image to the complexity of layouts (many commercial OCR products do not deal well with the complex formats found in newspapers). Smaller fonts, often used in older newspapers, require higher image resolution for optimal OCR performance. Arlitsch concludes that digitizing historic newspapers is a much more complex task than generating text from other documents and states, “nothing but the most robust OCR software should be used for historic newspapers.”9
OCR is run on article images, or more precisely, on a number of rectangular regions that comprise an article. The resulting OCR text is assembled for each article as well as for the entire page. iArchives’ OCR framework employs several of the best commercial OCR engines. The OCR framework assumes that errors made by different OCR engines are weakly correlated, thus, in cases where the OCR engines do not agree on the word found at a particular location (node), the result of each engine is preserved. This technique improves search recall over that of a single OCR engine, especially for low quality images.
The OCR text produced for use by Greenstone or other content delivery software is assembled from these results and may consist of the single most likely result for a location where the OCR engines did not agree, the most likely results, or all results. These results may be further filtered by a stop word list, one or more dictionaries (including non-English languages), or a custom dictionary. For the Tundra Times all results are preserved, and no stop word list was used, but a filter was used to reduce “noise” words. iArchives guarantee that the OCR accuracy will always be as good as the best commercial engines. They conducted a measured OCR accuracy test for us on two sample images and achieved a 98.4% accuracy rate.
Digital Object Production
The final step in production is the assembly of digital objects comprising the images, OCR text, and metadata gathered by the WFM. Two deliverables are sent to Barrow on DVD: (1) objects for each issue of the newspaper consisting of PDF Searchable Image files containing 125 dpi JPEG images and OCR text, JPEG thumbnails, and XML files describing the location of each word discovered on the page and the metadata gathered during processing and (2) archival 4-bit grayscale TIFF images.
Quality Control
Manual and automated quality control steps are built into the scanning process and the WFM. For example, a missing page report is generated if the numerical sequence indicates a page is missing. iArchives also developed a tool to enable the archival technician to request a replacement image if she encounters a bad image during markup.
Metadata accuracy is somewhat compromised as only one person performs the initial data entry, rekeys it, and then reconciles the differences. Ideally a second operator would rekey the data while a third reconciles any differences. Service bureaus that do re-key work use the following rule of thumb: single key yields ~95% accuracy, double key with second key doing reconciliation yields 99.5% accuracy, and double key with blind reconcile yields 99.95% accuracy. Using this as a guide, we can expect accuracy around 97-99%. To some extent the redundancies inherent in fulltext will mitigate any incorrectly entered metadata; however, end users will experience some reduction in searchability. Such compromises are the unfortunate reality of working in a remote location with limited resources.
The Library contracted with Information Access, a library consulting firm, to assist with project management, provide mentoring and training to the archival technician, and ensure the project met digital collections standards. Additionally, iArchives staff traveled to Barrow to train the archival technician in the use of their software and procedures and provided ongoing telephone support and backup as part of a support contract.
EVALUATION
The project recently completed its second year. The Library was able to meet the majority of its project objectives by blending the best solutions from other digitization projects to develop a viable and cost effective method for digitizing a historical newspaper and making it available over the Internet.
However, disappointingly, we were able to publish only 17 of the 35 years of the newspaper within the two-year grant period. We have identified three main reasons for this: insufficient staff, a delayed project start date, and an increase in the number of pages to be processed.
The original proposal called for the technician to receive part time assistance from students from Ilisagvik College and, during summer vacations, two library science students were to come to Barrow to work on the project. However, for various reasons, neither option came to pass. This significantly impacted the production output. The technician averages only 4 hours a day on the project, both because the work is very detailed (and it is not possible to work for longer than this without losing accuracy) and she has other responsibilities.
Secondly, the project start date was delayed for almost a year, due to adoption of the Utah model as opposed to the more manual process outlined in the grant application and the time needed to determine project specifications and negotiate contracts with project vendors. Actual processing began in September 2003. Finally, once we received the rolls of microfilm we revised the estimate of the number of pages to be processed from 20,000 to 27,000.
IMLS has granted an extension until September 2005. We project their funding will be exhausted by April 2005; however, another funding source has been identified and is sufficient to complete the project. Given the current rate of progress an additional 13 months is required to complete the project.
What has worked well, given constraints of budget, staffing, and location, is asking all external vendors to provide training and support. This provided critical backup and continuity and enabled the project to be undertaken in Barrow using local staff.
As with any project we have encountered obstacles and setbacks. Solving IT infrastructure problems has been time consuming because the Library does not have a dedicated IT person. Dealing with multiple vendors and non-traditional software has stretched college IT staff and required good will on both sides to resolve issues in a timely way.
Sometimes the ramifications of a decision are not clear until the data are loaded and searching begins. For example, the header section of each page (containing the page number and date) had been given a default classification of “unclassified”. When the first issues were loaded it became apparent that these unclassified non-content bearing images were cluttering the result set display. iArchives added the term “Header” to the classification list and the Greenstone plug-in was edited so that all the headers could be excluded from the result set display.
Now that sufficient data have been loaded we have been able to test Greenstone more thoroughly. We have discovered several small glitches such as broken thumbnail links, mistyped metadata, and things that we would like to improve such as listing each year separately in the browse section. A more serious problem emerged when we realized that the search term highlighting was not working properly. When search terms are too long (in fact, whenever they were longer than very short single terms such as “marine”) the URLs become too long for Internet Explorer to handle, causing search term highlighting to fail. A simple edit to change the form method from “get” to “post” solved the problem.
iArchives has added functionality to their software over the project period causing changes to the way the paper is processed. For example, in order to deal with hyphenation that frequently occurs in newspapers due to narrow column widths, the OCR software can now join word fragments so they are searchable. Incorporating this functionality part way through the project meant that the text is not generated in a consistent way, negatively impacting searching and retrieval.
COSTS
Newspaper scanning and processing costs are detailed in Table 1. Project costs vary based on the specific processing decisions made. We partnered with iArchives to pilot two processes: external use of their Web-based workflow management software and development of the XML plug-in for Greenstone digital library software and, therefore, received a discounted rate on processing costs. iArchives’ processing fee includes image processing, OCR, use of the WFM, and image delivery on DVD. Forty hours of support were provided for an additional $2,500.
Article zoning and metadata markup are the most labor-intensive and, therefore, most costly parts of the process. The Library’s technician performed these tasks. Her $21 hourly rate is the minimum wage in Barrow, a remote community where living costs are very high. An alternative to doing the metadata markup and page segmentation in house would have been to outsource these processes too. iArchives provided an initial cost estimate of $1.50 per page to enter article level metadata. They are able to offer such a relatively low per-page cost because they in turn outsource this work to overseas contractors who use the WFM to process the page images.
Given the high cost of doing business in Barrow and the fact that the sole technician is only able to work 4 hours a day on the project, using local staff for these tasks was not the most cost effective solution. However, we felt it was important for indigenous cultural and intellectual rights to be respected, and using local staff met other important project goals such as retention of local control, ensuring the economic benefit and skills remained in the community, and providing training and educational opportunities to project staff.
Our costs are based on the total number of pages processed by iArchives by December 2004 (12,600), and the time actually spent on the project by the technician (12 months as opposed to the elapsed time of 15 months due to vacation and delays in receiving work units and 4 hours/day as opposed to 8 hours/day). The technician processed approximately 57 pages a day, averaging 1,100 pages per month. This gives a per page cost of approximately $1.70/page. In the future we will use the built-in time tracker in the WFM to generate more precise cost figures.
This figure does not include overhead costs such as administration, IT, and benefits, or take into account the 4 hours per day not spent on the project. Chapman suggests the real figure is substantially higher: “the cost per indexed page image accessible on the Internet is approximately seven times higher than the unit cost of scanning and uncorrected OCR.”10 Puglia uses the following rule of thumb when analyzing the costs of digital imaging projects (admittedly for formats other than newspapers): approximately 1/3 of the cost is the digital conversion; slightly less than 1/3 of the cost is in metadata creation, including cataloging, description, and indexing; and slightly more than 1/3 of the cost is in other activities, such as administration and quality control.11
Table1: Per Page Digitization Costs
|
iArchives Scanning/page
|
iArchives Processing/page
|
Metadata Markup & Page Segmentation/page
|
Total Per Page
|
|
$0.15 |
$0.20 |
$1.70 |
$2.05 |
CONCLUSIONS
This article describes a possible approach for cultural organizations with limited budgets and staff resources seeking to digitize historical newspapers in their collections. The Library has published seventeen years of the Tundra Times on the Internet. This demonstrates that with the right technologies and sufficient vendor support, a small institution in a remote location with a limited budget can undertake a relatively sophisticated digitization project.
Writing about the newspaper’s place in history in 1988, Tundra Times founder and editor Howard Rock reported:
“When we came off the press for the first time over five years ago . . . the Native people were dead spiritually, it seemed, because no news media would publicize their tragic situations and their problems. The Tundra Times, more than anything else, I think, has awakened the fervor to do something and help to bring out the potential in leadership among our people.”12
The true value of this project is that for the first time users will have full access to a newspaper that contains a record of the events that transformed life for Native Alaskans and led to the political coming-of-age of today’s Alaska Native leaders. The information contained in the newspaper is now available to a new generation of Alaska Native youth and others interested in this period of Alaskan history.
Notes
1 Entlich, Richard, “Where are they now? Digitizing Microfilmed Newspapers,” RLG DigiNews, June 15, 2002, v. 6, no. 3.
2 Deegan, Marilyn, “Digitizing Historic Newspapers: Progress and Prospects,” RLG DigiNews, August 15, 2002. v. 6, no. 4.
3 Arlitsch, Kenning, “The Utah Digital Newspapers Project,” D-Lib Magazine, March 2003, v. 9, no. 3.
4 Viotti, Vicki, “Native language goes online,” Honolulu Advertiser, October 13, 2003.
5 Keegan, Te Taka, Apperley, Mark, Cunningham, Sally-Jo & Ian Witten, “The Niupepa Collection: Opening the Blinds on a Window to the Past,” 2001. Pages 347-356, in ICHIM01 International Cultural Heritage Informatics Meeting Conference, Volume 1 Full Papers. Edited by David Bearman and Franca Garzotto. Politecnico di Milano, Italy.
6 Witten, I. H., McNab, R.J., Boddie, S. J., and D. Bainbridge, “Greenstone: A Comprehensive Open-Source Digital Library Software System,” Pages 113-121, in Proc. Digital Libraries 2000, San Antonio, Texas.
7 Definition of PDF Searchable Image.
8 Arlitsch, Kenning and John Herbert. “Microfilm, Paper and OCR: Issues in Newspaper Digitization, The Utah Digital Newspapers Program,” Microform and Imaging Review, v. 33, no. 2, pp. 59-67.
9 Ibid, p. 63.
10 Chapman, Stephen. “Considerations for Project Management” in Handbook for Digital Projects: A Management Tool for Preservation and Access. Edited by Maxine K. Sitts. Andover, MA, Northeast Document Conservation Center, 2000, p. 32.
11 Puglia, Steven. “The Costs of Digital Imaging Projects,” RLG DigiNews, October 15, 1999, v. 3, no. 5.
12 Morgan, Lael. “Art and Eskimo Power. The Life and Times of Alaskan Howard Rock.” Epicenter Press, Fairbanks, 1988, p. 219.
