RLG
 Contents of: Volume 9, Number 1 ISSN 1093-5371  
  Feature Article 1: The Tundra Times Newspaper Digitization Project  
  Feature Article 2: Building a Globally Distributed Historical Sheet Map Set of Austro-Hungarian Topographic Maps, 1877-1914  
  Conference Report: Archiving Web Resources International Conference: Issues for Cultural Heritage Organisations  
  Highlighted Web Site: DP&I.com  
  FAQ: Getting SMART About Protecting Hard Disk Drives  
  Calendar of Events  
  Announcements  
  RLG News: Descriptive Metadata Guidelines; The EAD Report Card  
  Publishing Information  
 Feature Article 1  

The Tundra Times Newspaper Digitization Project

Authors: Judith A.K. Terpstra - Consultant, Tuzzy Consortium Library (kiwi@alaska.com), Frederick Zarndt - iArchives (Frederick.zarndt@iarchives.com), David Ongley - Tuzzy Consortium Library (David.ongley@tuzzy.org), Stefan Boddie - DL Consulting Limited (stefan@dlconsulting.co.nz)

ABSTRACT

Until recently, most newspaper digitization projects have been the province of large national or state organizations.1,2 Improvements in Optical Character Recognition (OCR) technology, the appearance of service bureaus specializing in the digitization of historical newspapers, and the availability of open-source digital library software have brought costs down and made it both technically and economically feasible for smaller organizations to undertake such projects.

This article describes how the Tuzzy Consortium Library, a small regional library in a very remote location (Barrow, Alaska) successfully undertook the digitization of the Tundra Times, a statewide newspaper that documents the history of Alaska Native peoples and their political struggles from 1962 to 1997.

The Library adopted and extended the methodology developed by the Utah Digital Newspapers Project.3 In two significant departures from the Utah approach the Library used contractor-supplied workflow-processing software that enabled the segmentation and metadata markup of the newspaper articles to be done locally, in Barrow. The Library also used Greenstone, an open-source digital library software suite rather than a commercial product, to publish the newspaper on the Internet.

Tundra Times website
Figure 1: Tundra Times website

BACKGROUND

The Tundra Times was a statewide, weekly and bi-weekly English-language newspaper whose primary audience was the Alaska Native population. It published from 1962 to 1997—a period of immense change for the Native peoples of Alaska and indeed the state as a whole. The newspaper reported on events that transformed the Native way of life, including settlement of land claims, founding of Native corporations, and the transfer of health and social services to Native-operated nonprofit organizations. The Tundra Times records these struggles for self-determination and compensation from an Alaska Native perspective and, as such, is an important resource for Alaska Natives, Native peoples worldwide, and researchers.

In 1997, after a long financial struggle, the Tundra Times ceased publication. The Ukpeagvik Iñupiat Corporation of Barrow acquired its archives and copyrights, and a year later, the collection was turned over to the Tuzzy Consortium Library. The Library serves as the academic library for Ilisagvik College and the public library for the North Slope Borough (NSB) of Alaska. Bordered on the north by the Arctic Ocean and on the south by the Brooks Range, the NSB is the northernmost organized municipality in the United States, lying entirely above the Arctic Circle. A major part of the Library’s mission is to disseminate collections such as the Tundra Times newspaper directly from Barrow.

In October 2002 the Library received a $150,000 two-year grant from the Institute of Museum and Library Services’ (IMLS) Native American Library Services program to digitize the microfilm of the 35-year run of the Tundra Times and make its articles accessible. Previously, this important cultural and political resource was rarely used since it had never been indexed and was available only on microfilm. Seventeen years of the paper are now Internet accessible.

DIGITIZATION PROCESS

In order to make a newspaper available for searching on the Internet, the following processes must take place: (1) the microfilm copy or paper original is scanned, (2) master and Web image files are generated, (3) metadata is assigned for each issue, page, and article to improve the searchability of the newspaper, (4) OCR software is run over high resolution images to create searchable fulltext, and (5) OCR text, images, and metadata are imported into a digital library software program.

With approximately 27,000 pages to process, one technician dedicated to the project and a tight two-year grant-funded timeframe, it was critical to determine which processes should be done in-house and which should be outsourced. From the outset it was clear that the scanning of the microfilm should be outsourced because this requires expensive and specialized equipment and considerable expertise to obtain the best results. The quality of the images significantly affects the accuracy of the OCR processing so this is a critical step. The service bureau that scans the microfilm normally also generates the derivative Web image files.

We searched for newspaper digitization models that would give us guidance on the remaining steps in the process: OCR processing, selection of digital library software, and assignment of metadata. We looked for approaches that were affordable and could be implemented by a small library with limited technical resources.

During the grant writing process we were fortunate to receive input and support from the Hawaiian Newspaper4 and Maori Newspaper5 projects. Both projects used ABBYY FineReader, a commercial OCR program, and did almost all processing in-house. Initially they experienced a very slow rate of progress due to the complexity of working with historical newspapers, the time needed to train the OCR dictionary to recognize Maori and Hawaiian alphabets, and because the OCR output required a great deal of manual error correction. Both projects used Greenstone open-source digital library software to publish their newspapers on the Internet. Our IMLS grant application was based on their approach.

Since the inception of the Maori and Hawaiian Newspaper projects there has been considerable improvement in OCR technology, and service bureaus have appeared that specialize in generating OCR text from images of historical newspaper articles. The J. Willard Marriott Library at the University of Utah prototyped a viable and affordable approach to newspaper digitization. They used a service bureau to scan and process newspapers to produce image files and XML-tagged metadata and a commercial digital library software program to publish the newspaper on the Internet.3 The Utah model offered clear advantages and prompted us to change direction from the more labor-intensive method used by the Hawaiian and Maori newspaper projects.

NEWSPAPER DIGITIZATION SERVICE BUREAU

We determined that the best approach for the Tundra Times project would be to outsource not only the scanning of the microfilm and generation of the derivative image files, but also the OCR processing and generation of XML files of the OCR text and metadata as these processes could be done more efficiently and less expensively by a service bureau.

The Library contracted these services to iArchives, Inc. They offer a patented digitization and OCR technology that has been used successfully in other newspaper projects, including the University of Utah’s. We selected iArchives as our vendor because they were cost effective, had a proven track record, and were willing to work with our smaller print run.

Use of iArchives’ Web-based workflow management software (WFM) (described in detail later) was an important part of our digitization strategy as it enabled the archival technician to work in Barrow. The WFM also enabled us to meet two other important project goals: retention of local control of the process and provision of training in advanced library technologies to the archival technician.

DIGITAL LIBRARY SOFTWARE

Selection of the digital library software that would provide access to the fulltext of the newspaper and images of articles was a critical step. Many software packages are available, including: Greenstone, the University of Michigan’s DLXS (free or low cost to non-profit institutions), and the commercial products CONTENTdm, ENCompass for Digital Collections, MS Content Management Server, or a custom solution.

We determined that Greenstone was the best way for us to publish the OCR text and page images on the Internet. Greenstone was developed by the New Zealand Digital Library Project and is issued under the terms of the GNU General Public License.6 It is a mature, respected, and widely used piece of software that has been under development for over five years, is stable, and well supported. Greenstone is designed to be portable to almost any operating system and requires only very modest system resources, meaning that usually it can be installed on existing hardware and doesn’t require a dedicated server. It is highly configurable and the user interface can be customized with relatively little effort. Greenstone has good multilingual support and works well for both searching and displaying non-Latin character sets. Although the Tundra Times is written in English, this facility is important because many personal and place names and native languages terms appear in it.

Greenstone met all our technical requirements, and because it is open-source, we were able to realize considerable savings compared to the cost of commercially available software. We used $2,000 to hire a consultant (DL Consulting Ltd.) to enhance Greenstone’s search and display capabilities and to create the XML plug-in, and anticipate spending an additional $1,000 to resolve issues discovered since loading data.

Greenstone is capable of accepting any number of input file types, however, we confirmed that formatting the OCR output file in XML would provide the most flexibility and portability. Development of the XML plug-in needed to implement this functionality required close cooperation between our digitization vendor and the Greenstone consultant. The XML plug-in stores all the information necessary to enable search term highlighting within the PDF Searchable Image files (formerly PDF Image+Text), a feature users have come to expect.7 Implementing this highlighting required complex customization of the Greenstone runtime system. Unfortunately, this feature works only under Windows with Internet Explorer and Acrobat Reader.

We requested an enhancement allowing searches to be limited by date range and article type, which we felt would benefit users dealing with large amounts of fulltext. We also intend to customize the “look” of the Greenstone interface by using the newspaper masthead instead of the generic Greenstone graphic.

Search Interface
Figure 2: Tundra Times Greenstone Search Interface

METHODOLOGY

Microfilm Scanning

The best microfilm for image projects is an unused, negative, first generation copy of the master. Although the negative was unavailable, we were very fortunate to receive an unused copy of positive microfilm from the Alaska State Library.

iArchives’ microfilm scanning process is largely automated and can include a variety of processes designed to improve image quality such as deskew, dynamic threshold, despeckle, bleed-through reduction, curvature removal, image format conversion, and scaling.

Microfilm rolls are scanned in batches, as the technician completes a work unit. Each roll contains approximately 700 page images, which are scanned at 400 dpi (some earlier images were scanned at a lower resolution). The master image is a cropped, de-skewed, 8-bit grayscale TIFF master scan, averaging 34 MB.

Image processing is always a trade-off. When an algorithm is applied to enhance an aspect of an image or eliminate a defect in a page (for example, bleed-through), the algorithm may at the same time adversely affect some other aspect of the image. Quality control steps are built in at critical points; however, the final quality control occurs when the customer reviews the delivered images.

We did discover variable quality in the microfilm, which naturally affects the quality of the images produced. In one instance the right side of the page is consistently darker than the left, leading one to conclude that a light source for the microfilm scan table was burned out or turned off. Problems such as this occur only very infrequently, therefore, no image correction processes were run on the master scan and Web derivative images. However, a binarization algorithm is run on the 1-bit OCR version of the image, which corrects for a gradient such as this, producing images of a more uniform contrast and brightness. This ensures the image is optimized for the OCR process (the OCR images are discarded after the data has been delivered).

Image Formats & Resolutions

A number of decisions needed to be made about the master and Web images including file format (TIFF, JPEG, JPEG2000, or PDF Searchable Image), bit depth (grayscale or bitonal), and resolution. Lack of available tools at the time precluded JPEG2000 from consideration. iArchives provided test images in a variety of formats and resolutions to help us decide among the other options.

iArchives delivered three images derived from the 8-bit grayscale TIFF master scan: (1) 4-bit grayscale, 400dpi TIFF image (average file size 17 MB), (2) 125dpi, 8-bit TIFF images converted to JPEG page and article files for Web delivery (average file size 0.53 MB), and (3) JPEG thumbnails of full pages.

We debated the issue of file size versus image quality at length for the Web versions of the page and article images. When we compared black and white test images to grayscale images it was clear that the grayscale images significantly improved the fidelity of the text and photographs. However, grayscale article and page images are significantly larger than 1-bit black-and-white images (0.53 MB compared to 0.25 MB). We decided that it was more important to have higher quality images than smaller file sizes and chose grayscale images for page and article images for Web delivery. Given the trend to ever-higher communication speeds and increasing availability of broadband, we decided that larger file sizes would not long be a problem.

Our Web delivery format is Adobe’s PDF Searchable Image file. Each PDF file comprises a JPEG image, OCR text, and select metadata and averages 0.56 MB per page. Adding the text to the PDF adds 3-8% to the file size (“noisy” images generate larger text files and are at the 8% end of the range). Each word is inserted into the file at offsets specified by the OCR text bounding box coordinates in the XML file. Newspaper articles can extend across several pages and all parts of an article are contained in a single PDF file. PDF Searchable Image files are viewable and searchable by Adobe’s Acrobat Reader, a common plug-in that is well supported by Adobe. Packaging larger JPEG files within the PDF ensures the text can be read comfortably onscreen, even by users with visual disabilities (using the Reader’s magnify/zoom option).

Workflow Management System (WFM)

The WFM is the heart of iArchives’ service – it tracks the images, project specifications, metadata, OCR text, and quality aspects associated with multiple projects. Once a customer’s project specifications are determined, the WFM distributes work to computers hosting automated and manual processes. It is this distributed processing model that enables the archival technician to work on the project in Barrow even though the WFM itself resides on a computer in iArchives’ Utah data center. She requests work units (newspaper issues) from the WFM and marks up page level images with article information and metadata. This data is saved locally until it is completed, then it is transmitted back to iArchives via FTP.

workflow
Figure 3: Tundra Times Workflow

The distributed workflow is scalable and could easily be adapted to enable the use of volunteers to markup the paper and assign metadata, the most time consuming and therefore most costly components of the digitization process described here. Given the financial constraints many small cultural institutions operate under, this approach could be the key to undertaking a digitization project like the Tundra Times.

Metadata

The archival technician assigns both page level and article level metadata. Each page image is tagged with the following metadata: (1) publication title, (2) publication date, (3) volume and issue number, and (4) page number. The page image is then segmented into articles and the following article level metadata is keyed: (1) headline, (2) byline, (3) classification, and (4) whether the article is a lead story.

Even the best commercially available OCR engines do not achieve 100 percent accuracy and for most projects it is not economically feasible to manually error correct OCR output. Metadata are important because, when combined with the fulltext, they can improve search accuracy and enable searches to be constrained to particular sections of the paper. For example, classifications enable end users to search the editorials, news articles, cartoons, or advertisements, and/or to search for a particular byline.

OCR Processing

OCR creates searchable text from digital images. Deegan2 and Arlitsch8 describe the difficult issues associated with OCR processing of newspaper images. Many factors affect the accuracy of OCR. These range from the quality of the source image to the complexity of layouts (many commercial OCR products do not deal well with the complex formats found in newspapers). Smaller fonts, often used in older newspapers, require higher image resolution for optimal OCR performance. Arlitsch concludes that digitizing historic newspapers is a much more complex task than generating text from other documents and states, “nothing but the most robust OCR software should be used for historic newspapers.”9

OCR is run on article images, or more precisely, on a number of rectangular regions that comprise an article. The resulting OCR text is assembled for each article as well as for the entire page. iArchives’ OCR framework employs several of the best commercial OCR engines. The OCR framework assumes that errors made by different OCR engines are weakly correlated, thus, in cases where the OCR engines do not agree on the word found at a particular location (node), the result of each engine is preserved. This technique improves search recall over that of a single OCR engine, especially for low quality images.

The OCR text produced for use by Greenstone or other content delivery software is assembled from these results and may consist of the single most likely result for a location where the OCR engines did not agree, the most likely results, or all results. These results may be further filtered by a stop word list, one or more dictionaries (including non-English languages), or a custom dictionary. For the Tundra Times all results are preserved, and no stop word list was used, but a filter was used to reduce “noise” words. iArchives guarantee that the OCR accuracy will always be as good as the best commercial engines. They conducted a measured OCR accuracy test for us on two sample images and achieved a 98.4% accuracy rate.

Digital Object Production

The final step in production is the assembly of digital objects comprising the images, OCR text, and metadata gathered by the WFM. Two deliverables are sent to Barrow on DVD: (1) objects for each issue of the newspaper consisting of PDF Searchable Image files containing 125 dpi JPEG images and OCR text, JPEG thumbnails, and XML files describing the location of each word discovered on the page and the metadata gathered during processing and (2) archival 4-bit grayscale TIFF images.

Quality Control

Manual and automated quality control steps are built into the scanning process and the WFM. For example, a missing page report is generated if the numerical sequence indicates a page is missing. iArchives also developed a tool to enable the archival technician to request a replacement image if she encounters a bad image during markup.

Metadata accuracy is somewhat compromised as only one person performs the initial data entry, rekeys it, and then reconciles the differences. Ideally a second operator would rekey the data while a third reconciles any differences. Service bureaus that do re-key work use the following rule of thumb: single key yields ~95% accuracy, double key with second key doing reconciliation yields 99.5% accuracy, and double key with blind reconcile yields 99.95% accuracy. Using this as a guide, we can expect accuracy around 97-99%. To some extent the redundancies inherent in fulltext will mitigate any incorrectly entered metadata; however, end users will experience some reduction in searchability. Such compromises are the unfortunate reality of working in a remote location with limited resources.

The Library contracted with Information Access, a library consulting firm, to assist with project management, provide mentoring and training to the archival technician, and ensure the project met digital collections standards. Additionally, iArchives staff traveled to Barrow to train the archival technician in the use of their software and procedures and provided ongoing telephone support and backup as part of a support contract.

EVALUATION

The project recently completed its second year. The Library was able to meet the majority of its project objectives by blending the best solutions from other digitization projects to develop a viable and cost effective method for digitizing a historical newspaper and making it available over the Internet.

However, disappointingly, we were able to publish only 17 of the 35 years of the newspaper within the two-year grant period. We have identified three main reasons for this: insufficient staff, a delayed project start date, and an increase in the number of pages to be processed.

The original proposal called for the technician to receive part time assistance from students from Ilisagvik College and, during summer vacations, two library science students were to come to Barrow to work on the project. However, for various reasons, neither option came to pass. This significantly impacted the production output. The technician averages only 4 hours a day on the project, both because the work is very detailed (and it is not possible to work for longer than this without losing accuracy) and she has other responsibilities.

Secondly, the project start date was delayed for almost a year, due to adoption of the Utah model as opposed to the more manual process outlined in the grant application and the time needed to determine project specifications and negotiate contracts with project vendors. Actual processing began in September 2003. Finally, once we received the rolls of microfilm we revised the estimate of the number of pages to be processed from 20,000 to 27,000.

IMLS has granted an extension until September 2005. We project their funding will be exhausted by April 2005; however, another funding source has been identified and is sufficient to complete the project. Given the current rate of progress an additional 13 months is required to complete the project.

What has worked well, given constraints of budget, staffing, and location, is asking all external vendors to provide training and support. This provided critical backup and continuity and enabled the project to be undertaken in Barrow using local staff.

As with any project we have encountered obstacles and setbacks. Solving IT infrastructure problems has been time consuming because the Library does not have a dedicated IT person. Dealing with multiple vendors and non-traditional software has stretched college IT staff and required good will on both sides to resolve issues in a timely way.

Sometimes the ramifications of a decision are not clear until the data are loaded and searching begins. For example, the header section of each page (containing the page number and date) had been given a default classification of “unclassified”. When the first issues were loaded it became apparent that these unclassified non-content bearing images were cluttering the result set display. iArchives added the term “Header” to the classification list and the Greenstone plug-in was edited so that all the headers could be excluded from the result set display.

Now that sufficient data have been loaded we have been able to test Greenstone more thoroughly. We have discovered several small glitches such as broken thumbnail links, mistyped metadata, and things that we would like to improve such as listing each year separately in the browse section. A more serious problem emerged when we realized that the search term highlighting was not working properly. When search terms are too long (in fact, whenever they were longer than very short single terms such as “marine”) the URLs become too long for Internet Explorer to handle, causing search term highlighting to fail. A simple edit to change the form method from “get” to “post” solved the problem.

iArchives has added functionality to their software over the project period causing changes to the way the paper is processed. For example, in order to deal with hyphenation that frequently occurs in newspapers due to narrow column widths, the OCR software can now join word fragments so they are searchable. Incorporating this functionality part way through the project meant that the text is not generated in a consistent way, negatively impacting searching and retrieval.

COSTS

Newspaper scanning and processing costs are detailed in Table 1. Project costs vary based on the specific processing decisions made. We partnered with iArchives to pilot two processes: external use of their Web-based workflow management software and development of the XML plug-in for Greenstone digital library software and, therefore, received a discounted rate on processing costs. iArchives’ processing fee includes image processing, OCR, use of the WFM, and image delivery on DVD. Forty hours of support were provided for an additional $2,500.

Article zoning and metadata markup are the most labor-intensive and, therefore, most costly parts of the process. The Library’s technician performed these tasks. Her $21 hourly rate is the minimum wage in Barrow, a remote community where living costs are very high. An alternative to doing the metadata markup and page segmentation in house would have been to outsource these processes too. iArchives provided an initial cost estimate of $1.50 per page to enter article level metadata. They are able to offer such a relatively low per-page cost because they in turn outsource this work to overseas contractors who use the WFM to process the page images.

Given the high cost of doing business in Barrow and the fact that the sole technician is only able to work 4 hours a day on the project, using local staff for these tasks was not the most cost effective solution. However, we felt it was important for indigenous cultural and intellectual rights to be respected, and using local staff met other important project goals such as retention of local control, ensuring the economic benefit and skills remained in the community, and providing training and educational opportunities to project staff.

Our costs are based on the total number of pages processed by iArchives by December 2004 (12,600), and the time actually spent on the project by the technician (12 months as opposed to the elapsed time of 15 months due to vacation and delays in receiving work units and 4 hours/day as opposed to 8 hours/day). The technician processed approximately 57 pages a day, averaging 1,100 pages per month. This gives a per page cost of approximately $1.70/page. In the future we will use the built-in time tracker in the WFM to generate more precise cost figures.

This figure does not include overhead costs such as administration, IT, and benefits, or take into account the 4 hours per day not spent on the project. Chapman suggests the real figure is substantially higher: “the cost per indexed page image accessible on the Internet is approximately seven times higher than the unit cost of scanning and uncorrected OCR.”10 Puglia uses the following rule of thumb when analyzing the costs of digital imaging projects (admittedly for formats other than newspapers): approximately 1/3 of the cost is the digital conversion; slightly less than 1/3 of the cost is in metadata creation, including cataloging, description, and indexing; and slightly more than 1/3 of the cost is in other activities, such as administration and quality control.11

Table1: Per Page Digitization Costs

iArchives Scanning/page

 

iArchives Processing/page

 

Metadata Markup & Page Segmentation/page

 

Total Per Page

 

$0.15

$0.20

$1.70

$2.05

CONCLUSIONS

This article describes a possible approach for cultural organizations with limited budgets and staff resources seeking to digitize historical newspapers in their collections. The Library has published seventeen years of the Tundra Times on the Internet. This demonstrates that with the right technologies and sufficient vendor support, a small institution in a remote location with a limited budget can undertake a relatively sophisticated digitization project.

Writing about the newspaper’s place in history in 1988, Tundra Times founder and editor Howard Rock reported:

“When we came off the press for the first time over five years ago . . . the Native people were dead spiritually, it seemed, because no news media would publicize their tragic situations and their problems. The Tundra Times, more than anything else, I think, has awakened the fervor to do something and help to bring out the potential in leadership among our people.”12

The true value of this project is that for the first time users will have full access to a newspaper that contains a record of the events that transformed life for Native Alaskans and led to the political coming-of-age of today’s Alaska Native leaders. The information contained in the newspaper is now available to a new generation of Alaska Native youth and others interested in this period of Alaskan history.

Notes

1 Entlich, Richard, “Where are they now? Digitizing Microfilmed Newspapers,” RLG DigiNews, June 15, 2002, v. 6, no. 3.

2 Deegan, Marilyn, “Digitizing Historic Newspapers: Progress and Prospects,” RLG DigiNews, August 15, 2002. v. 6, no. 4.

3 Arlitsch, Kenning, “The Utah Digital Newspapers Project,” D-Lib Magazine, March 2003, v. 9, no. 3.

4 Viotti, Vicki, “Native language goes online,” Honolulu Advertiser, October 13, 2003.

5 Keegan, Te Taka, Apperley, Mark, Cunningham, Sally-Jo & Ian Witten, “The Niupepa Collection: Opening the Blinds on a Window to the Past,” 2001. Pages 347-356, in ICHIM01 International Cultural Heritage Informatics Meeting Conference, Volume 1 Full Papers. Edited by David Bearman and Franca Garzotto. Politecnico di Milano, Italy.

6 Witten, I. H., McNab, R.J., Boddie, S. J., and D. Bainbridge, “Greenstone: A Comprehensive Open-Source Digital Library Software System,” Pages 113-121, in Proc. Digital Libraries 2000, San Antonio, Texas.

7 Definition of PDF Searchable Image.

8 Arlitsch, Kenning and John Herbert. “Microfilm, Paper and OCR: Issues in Newspaper Digitization, The Utah Digital Newspapers Program,” Microform and Imaging Review, v. 33, no. 2, pp. 59-67.

9 Ibid, p. 63.

10 Chapman, Stephen. “Considerations for Project Management” in Handbook for Digital Projects: A Management Tool for Preservation and Access. Edited by Maxine K. Sitts. Andover, MA, Northeast Document Conservation Center, 2000, p. 32.

11 Puglia, Steven. “The Costs of Digital Imaging Projects,”  RLG DigiNews, October 15, 1999, v. 3, no. 5.

12 Morgan, Lael. “Art and Eskimo Power. The Life and Times of Alaskan Howard Rock.” Epicenter Press, Fairbanks, 1988, p. 219.


 Feature Article 2  

Building a Globally Distributed Historical Sheet Map Set of Austro-Hungarian Topographic Maps, 1877-1914

Author: Patrick McGlamery - University of Connecticut (Patrick.McGlamery@uconn.edu)

Introduction

Working with sets of topographic maps has always been difficult for scholars, geographers, and librarians. Books are linear: page 2 generally following page 1 after we get past the messy i, ii and iii. Maps, on the other hand, are multi-dimensional, set out like tiles on a floor in a Cartesian grid, but the content within changes over time. Consider maps depicting Warsaw, but the content represented within changes over a hundred years. The scholar has these sheets spread out over the floor and follows a road north to Gdansk, or is it Danzig?, and moves to an adjacent sheet. Is it in the collection? Is it for the same date? Due to political and military upheaval, map collections for this region of Europe are often dispersed, landing in collections and libraries around the world.

Scholars refer to these time and space representations to recreate a world gone by but captured cartographically. Shtetls long emptied, battlefields forgotten, farms now developed. Maps pick up annotations, troop movements, “X marks the spot,” plans for highways, canals, new settlements, boundaries and borders. Librarians are challenged to bring this globally dispersed collection together, virtually, for international scholars in the general public.

“AuHu75” and the IMLS Research and Development Grant

Recently, the University of Connecticut’s Homer Babbidge Library was awarded a National Leadership Grant for Libraries from the Institute for Libraries and Museum Services for Building a Globally Distributed Historical Sheet Map Set. In partnership with the New York Public Library and the American Geographical Society’s Map Library at the University of Wisconsin, Milwaukee, the University of Connecticut Libraries has committed to achieve the goal of developing a model tool that will bring together globally dispersed sheet map collections. The aim of this project is to create an international, metadata-driven, dynamic access tool that will enable users to find and view scanned and geo-referenced images from 1877-1914 Austro-Hungarian topographic maps (AuHu75) by querying an easy-to-use digital gazetteer.

Building Digital Collections: People

The design for this project grew out of discussions at various conferences for map librarians. Seven libraries in the United States and Europe are participating in the model design. In addition to the New York Public Library, the American Geographical Society Map Library, and the University of Connecticut’s Homer Babbidge Library, four European libraries are cooperating, though not yet funded: Oxford University’s Bodleian Library; and the national libraries of Denmark (the Kongelige Bibliotek), Slovenia (Narodna in Univerzitetna Knjiznica), and Croatia (Nacionalna I Sveucilisna Knjiznica).

Maps are Cartesian representations of the Earth’s surface in a standard cartographic symbology that is embraced world-wide. In the United States, federal direction has come from the Federal Interagency Coordinating Committee on Digital Cartography in 1983 and the Federal Geographic Data Committee (FGDC) in 1990. The National Academy of Sciences’ Mapping Science Committee (MSC) serves as a focus for external advice to federal agencies on scientific and technical matters related to spatial data handling and analysis, supporting the development of a robust national spatial data infrastructure for making informed decisions at all levels of government and throughout society in general. The concept of a National Spatial Data Infrastructure (NSDI) was first advanced by the MSC in its 1993 report, Toward a Coordinated Spatial Data Infrastructure for the Nation, and implemented in 1994 by Executive Order 12906, "Coordinating Geographic Data Acquisition and Access: The National Spatial Data Infrastructure.” Subsequent MSC reports have addressed specific components of the NSDI including partnerships (Promoting the National Spatial Data Infrastructure Through Partnerships, 1994), basic data types (A Data Foundation for the National Spatial Data Infrastructure, 1995), and future trends (The Future of Spatial Data and Society, 1997). On June 15, 1998, the MSC convened a workshop, entitled Distributed GeoLibraries, to explore:

  • a vision for geospatial data dissemination and access in 2010
  • comparisons of different efforts in digital library research, clearinghouse development, and other data distribution
  • suggestions of short and long term research needs
  • identification of policy and institutional issues

That workshop recognized the strong, integral relationship between libraries and the geo-spatial community. Map librarians have been engaged with spatial metadata since its inception in 1992. They have used Geographic Information System (GIS) software since 1990, sharing training, insights, and tips.

Key to this project is engaging champions who understand the changing nature, not only of mapping, GIS, and cartography, but of the role of libraries as store houses of scientific data, albeit often in analog format. By scanning and geo-referencing the map images, map libraries are positioned for and integral partners in spatial information research. Each of the project libraries have been using GIS for up to a decade at varying levels. Most have ArcGIS software through a generous relationship with ESRI, a GIS software provider.

Building Digital Collections: Technology

U.S. libraries began engaging in GIS around 1990 in response to the Bureau of the Census’ TIGER Files, the geo-spatial data that began MapQuest and its ilk. The perfect storm of vast amounts of large numeric data files, CD-ROM as a cheap and convenient alternative to 9-track tape, and a unique partnership between ESRI and libraries fostered by the Association for Research Libraries came together to establish a corps of educated librarians with dedicated software (ArcView), high-powered hardware, spatial geodata, and a user community demanding the technology. Since the 1990s, the technology has moved from CD to network; metadata tools have improved; and the hardware dollar buys more power. ESRI’s ArcGIS is hard-coded to create a metadata record for any database the user creates and uses that metadata as a core component. With twelve years of maturity, metadata has become the linchpin for data sharing in the GIS community.

GIS software


The project model uses existing GIS technologies, in particular ESRI’s ArcIMS (Internet Map Server) Metadata Server and ESRI’s Geography Network geodatabase protocol, as well as existing cataloging/metadata programs, such as FGDC’s Spatial Metadata Training programs.

Building Digital Collections: Content

The map set selected for this model, with its focus on the Austro-Hungarian Empire, was determined at a meeting of map librarians during the 64th Congress of the International Federation of Library Associations and Institutions (IFLA) in Boston and confirmed at a subsequent meeting of map librarians at the International Conference on the History of Cartography in Boston and Portland, Maine in 2003. Over 1 million people living in the Austro-Hungarian Empire immigrated to the United States in the period 1820 to 1880 and over 4 million during the 1880 to 1930 period. Descendants of these immigrants are part of the World Wide Web genealogical community and are heavy map library users. One of the partners of this grant is the Latter Day Saints Church Family History Library. The Church created and sponsors the FamilySearch database of ancestors, a core tool for the genealogical user community. Current estimates of use show 176,000 potential views per month that might concern Austro-Hungary, or about 5,866 views per day. Within the scope of this project, FamilySearch intends to integrate access to the scanned maps enabling a lifelong learner to locate a named person, and if their place of birth or death is given, to link to an image of the historical map, zooming into the historical location of the birth or death event. The named person’s records with dated place name will drive the metadata query, selecting the most temporally and spatially appropriate scanned map images from among the partner libraries.

The AuHu75 map set was published in Vienna over a period of years, 1877-1914. The set is comprised of 776 sheets with all sheets from all editions numbering over 3,665. Each map represents a 15 degree minute x 30 degree minute (about 1,000 sq. kilometers) ‘tile’ and comes in several editions and print dates. The maps detail the physical and cultural landscape, showing hills, valleys and rivers, as well as houses, mills, factories, and farms. The cartography represents deciduous and coniferous trees differently and shows natural forests and planting, like the trees lining the stream in the figure below. The variant place names are also clearly evident. Other map sets published before, during, and after WWII used the Spezialkarte der Österreichisch-ungarnischen Monarchie as a base map, making the AuHu75 a key map for studying a large and significant area of Central Europe in the 19th and 20th centuries. As a set of maps, it enjoys an enormous, sophisticated user community made up of historians, genealogists, archaeologists, architects, and lawyers researching WWII reparations claims.

sample map

There are few comprehensive collections of the AuHu75 maps and each of the collections has a distinguished provenance. The delegates to the League of Nations demarcated the boundaries after WWI using sheets from the AGS’ collection. Map sheets held by Slovenia and Croatia are part of their national heritage and reflect their unique nature with annotations and marginalia. The New York Public Library’s collection serves a community of citizens and scholars and works integrally with other collections at NYPL, such as those found in the Slavic & Baltic, Jewish and U.S. History, Local History and Genealogy divisions and the archival collections of the YIVO Institute for Jewish Research. The University of Connecticut acquired its map sheets from the Library of Congress, which got them from the U.S. Department of Defense via the U. S. Army, which captured them from the German Army, which captured them from the Russian Army. The AGS holds 1,000 sheets, the NYPL 776 sheets, and University of Connecticut’s Map and Geographic Information Center (MAGIC) 542 sheets, for a total of 2,318 images. Clearly other libraries such as the Library of Congress, the British Library, and the National Libraries of Austria and Hungary have large and significant collections. The common thread is that no library holds all of the permutations of the map sheets, especially as they were used as documents to codify political, military, or diplomatic decisions. Bringing various map sheets together virtually from a global network of libraries will add value to each distinct library collection. Distributed Network

As noted, the goal of this project is to design an international, metadata-driven, dynamic access tool that will enable users to access scanned and geo-referenced map images by querying an easy to use digital gazetteer. The combined holdings of AGS, NYPL, and UConn MAGIC will be “geo-referenced” to create a single digital gazetteer for locating place-names on the map images. The gazetteer will include both current and historic names referenced to the same geographic location. An Internet Map Server (ArcIMS) will index the metadata from the cooperating sites and provide interfaces to the imagery, “zooming” to the queried place. The interface will bring together the appropriate image from one or more of the libraries, based on date and place-names. The project will also investigate image data compression, geo-referencing, merging metadata and digital gazetteers, new GIS applications, and cooperative collection building and sharing.

AuHu75: How-To

Step 1 – Importing Data

The first step involves acquiring images of historical paper maps using a roll scanner with a resolution of 300 dots per inch (DPI). Images are saved as grayscale TIFF images.

latitude longitudeStep 2 – Coordinate Conversion

In step 2 the Degrees-Minutes-Seconds (DMS) of the original maps are converted to Decimal Degrees (DD) to six significant digits. Contemporary maps use 0° Longitude, the Prime Meridian, running through Greenwich England, as the origin for east-west measures. The Austro-Hungarian maps rely on the Ferro Line as their prime meridian point of origin for east-west measures. This requires a shift in longitude of 17.662778° to the east in order to account for the difference. This is done to ensure that, once geo-rectified, the Austro-Hungarian maps will be compatible with Greenwich PM latitude and longitude measurements.

Step 3 – Geo-Rectification

This step transforms the planimetric raster images of the Austro-Hungarian maps to an image that is correctly projected to the surface of the earth. The geo-referencing tool within ArcMap will be used to perform an affine transformation from the X,Y grid coordinates of the map image to the X,Y coordinates of the chosen projection. All Austro-Hungarian maps will be projected to the World Geodetic System 1984 (WGS84) using latitude and longitude. geo-rectification

Twelve Ground Control Points (GCPs, shown as red crosses) are chosen for each of the Austro-Hungarian maps. Using the coordinates from Step 2, the coordinate table is updated. Once geo-rectified, the images’ geometry changes and they look elongated. The images are compressed into two compressed file formats, MrSid and JP2 file formats. The TIFF map image files, scanned at 300 dpi, average 24 megabytes and are compressed to 1.2 megabytes.



bounding boxesStep 4 – Creating Bounding Rectangles

Using ArcView, this step creates bounding rectangles to remove the white margin map borders. These geo-referenced map images are then combined to create a mosaic of the entire area.






Step 5 – Metadata

metadataGEODEX (GEOgraphic InDEX System for Map Series) is a key part of the metadata creation process. GEODEX was developed by Christopher Baruth of the American Geographical Society Library in the mid 1980s to inventory and access the AGS Library’s map sheets in series. It provides for fast geographic (latitude/longitude) searching, by point or area, and rapid input of data into the system. With the exception of the sheet name, number, and a brief optional note, all other data fields are coded and fall into two categories: fixed fields and open fields. In addition to holdings, the fixed fields record that the map is a part of a series; that it is a topographical map using hachures to show elevation; that it is a monochrome printed map; the format is a standard quadrangle; the quadrangle’s dimensions are 15 x 30 degree minutes; it is drawn on the Polyhedric projection; it uses Ferro, not Greenwich, prime meridian; and its scale is 1:75,000. A GEODEX record can also contain up to five open fields that record contour intervals, editions, and various date types.

Currently there are 347,729 GEODEX records from some 169 multi-sheet map sets. This level of sheet control provides a significant savings in labor for the AuHu75 Project. GEODEX was written in QBasic, a 1980s programming language that is not readily accessible. Cross-walking the GEODEX database to the FGDC Content Standard for Digital Geospatial Metadata will enable libraries to add their sheet-level holdings to an international union collection.

metadata


MetaLite
, a public domain metadata creation tool written in Visual Basic 8.0, is being rewritten as metadataMetaLite for Librarians (ML4ML) with the capacity to import and cross-walk GEODEX formatted metadata. Step 4 will use ML4ML to select a sheet record for GEODEX in the FGDC metadata content standard and then edit it to describe the “map in hand.” The geo-referenced image with the edited FGDC metadata record is then opened in ArcCatalog to insert technical and geo-spatial metadata and to register it in ArcGIS.



Step 6 – Creating the Place-Name Point Layer

place name point mapThe AuHu75 maps a polyglot Empire. German, Hungarian, Polish, Czech, and other place-names are represented on the maps. For the genealogist it is always a challenge to navigate names, often aurally remembered and handed down in English, Yiddish, or even “Brooklynese.” This project will dynamically link the place-name to its spot on the map, using the U.S. Department of Defense’s Gazetteer. The names retain the typonymic diacritics relevant to their languages. Its drawback is that it goes to the Degree Minute level, lacking Seconds. Place-names are ‘rounded’ to the nearest Minute. This approach will get the scholar close to the name on the map, but it is less than optimal.

degree conversionThe initial step is to convert degrees and minutes to decimal degrees: DDDMM to DD.DDD. Then, determining the spatial area of the Austro-Hungarian Empire, a spatial SQL is used to extract place-names and clip place-name points for the Austro-Hungarian Empire.






Step 7 – Building the interface

Maps scanned and provided on the World Wide Web can take advantage of the inherent qualities of cartographic information in ways not possible with text files. Maps have four integral components: Cartesian logic, temporal nature, semantic features, and dense quantitative information content. These components provide opportunities for creating a dynamic Web-based union list of distributed maps by building metadata that specify the latitude/longitude and the date of the map and by utilizing gazetteers as linking mechanisms between the place-name and the map image.

The map librarians involved in this project understand the unique nature of cartographic information in its analog state but more importantly as digital objects. They have worked closely with the user community to provide map sheets that are typically not cataloged at the sheet level, that represent space in a Cartesian system not based on Greenwich, come with place-names in German, Polish, Hungarian, Czech, Slovenian, and Croatian, but not English, and represent an Empire whose internal boundaries were in constant flux. The typical reference encounter for a map librarian with map sheets and gazetteers at hand for the Austro-Hungarian Empire is a one to two hour consultation.

ArcGIS is a software program that can transform a scanned digital surrogate of a map into a geo-spatial object. This object can then respond to other geo-spatial datasets, such as a digital gazetteer. ArcGIS is a product of Environmental ESRI, a GIS software developer that has been supporting library programs for a decade through training (in cooperation with ARL), gifts of software, and resources for meetings. The ESRI software includes an Internet map server, ArcIMS. ArcIMS Metadata Server will serve as the engine that provides for the selection of the appropriate sheet or sheets based on the spatial, temporal, typonymic, and language constraints of the query. The ESRI White Paper, “Implementing a Metadata Catalog Portal in a GIS Network,” presents a framework for building open, interoperable GIS catalog portals and discusses the capabilities of ArcGIS and how these tools can be used in the critical aspects of building, serving, and using GIS catalogs in Spatial Data Infrastructure.

prototypeBased on metadata and sound cataloging concepts, sharing maps can be expanded in the future from the AuHu75 map set to include any maps scanned and either geo-referenced through the tools available to the librarian in ArcMap or simply as an ungeo-referenced image. During the course of the grant we will re-code the public domain program MetaLite, a metadata creation tool created by the US Geological Survey and the United Nations, to import two types of database records, MARC and GEODEX. A prototype of the AuHu75 can be found on MAGIC's website. Enter the place-name Rad at the pushpin “Locate Address” icon (11th down on the right).

The User Community

The Church of Latter Day Saints’ FamilySearch database supports and advises an international community of users. Genealogists make use of the FamilySearch database and also of hundreds of libraries across the country and around the world that provide free access to microfilm and to databases. A map engages a beginning researcher in the process of understanding how places play a role in retrieving genealogical information. Many beginners might not understand the geographical context of their ancestor’s life. They do not realize that different life events may have occurred in different locations and hence records must be looked for in different places. Maps also identify contiguous localities where information might be found on families of siblings and in-laws. Maps represent an essential part of the process of envisioning an ancestor’s life in terms of the places they lived and the location of records that pertain to them. Although maps are of great value to all researchers, they are particularly important in helping a beginner get started.

Maps with jurisdictional boundaries are significantly more important than maps that just show places. Even if the person was born, married, and died in the same place, the recording of those events might be found in different places because genealogical records are created by jurisdictional entities. In many cases these entities store the records in places other than the place where the ancestor resided. Jurisdictional boundaries also predict where one might find the records stored today. For instance, records for pieces of a historical province now belonging to several modern provinces might readily be found in the capital city of the historical province and not the current provincial center.

Summary

The AuHu75 provides an exciting initial project. The material is in the public domain; the sheets are black & white, and uniform in size with a consistent projection, prime meridian and coordinate system; and the AGS has cataloged many of the sheets in their GEODEX program. Using available tools, documentation, and dispersed holdings, this project will provide a powerful search, query, and discovery tool at the historical ancestor level.


 Conference Report  

Archiving Web Resources International Conference: Issues for Cultural Heritage Organisations

Author: Margaret E. Phillips - National Library of Australia (mphillips@nla.gov.au)

In November 2004, the National Library of Australia hosted the Archiving Web Resources International Conference. Over 200 delegates from 21 countries, representing libraries, archives, museums, galleries, and government agencies, gathered to hear papers and to discuss the issues for cultural heritage organisations.

The first part of the conference described the current state of usage of the Web for the distribution of cultural and documentary materials, and looked at the reasons for the need to archive and preserve them. Keynote speaker Dr Malcolm Gillies, Deputy Vice-Chancellor (Education) at the Australian National University, set the scene and outlined the reasons for archiving and preserving. If steps are not taken to archive and preserve Web materials, much of our cultural heritage will simply disappear and, with it, the current and future material of research.

Gillies described the timeline of human cultural transmission. We have moved through the stages from oral transmission of information to manuscripts, printing, and now digital formats. He likened the use of digital formats to a return to oral transmission as there is no physical residue, and significant materials may exist in a single copy on a website. Because of the nature of the material, its survival is dependent on conscious archival management, without which it will certainly disappear or, at best, become unusable.

Gillies’ paper was followed by a number of speakers with particular areas of expertise: science, spatial data, contemporary culture (blogs and wikis), and digital art. Their presentations demonstrated the predominance of the Web for information creation and dissemination in all sectors and highlighted the size, complexity, dynamic nature, and dependence on technology that are characteristic of the contents of Web archives. The challenges that these would pose for cultural institutions were apparent.

On the second day of the conference, a number of speakers representing organisations which have already commenced Web archiving programs addressed the question ‘what to collect and how to do it?’ Comprehensive, selective, and thematic or subject based approaches to archiving were described, as well as a risk management approach, Virtual Remote Control. All of these approaches have advantages as well as disadvantages. There is now a trend for a Web archiving program to supplement its principal approach with another approach, for instance, to supplement whole domain harvesting with some more targeted selective archiving, as is the case in Sweden, or to supplement selective archiving with periodic whole domain harvests, as the United Kingdom plans to do.

The need for co-operation among agencies involved in Web archiving was an overriding theme of the conference. A number of speakers expressed the need to collaborate and to aim for a network of interoperable archival institutions that would provide for the user seamless access to a global archived digital resource.

Julien Masanès from the Bibliothèque nationale de France and coordinator of the International Internet Preservation Consortium (IIPC) outlined the work of the Consortium which has emerged as a major avenue for co-operation by national libraries and the Internet Archive since its inception in July 2003. The aim of the IIPC is to develop tools and standards that will help establish an international network of archived resources. Speakers highlighted other examples of co-operation already underway, including TEL (The European Library), NESTOR (Network of Expertise in Long-Term Storage of Digital Resources for Germany), the UK Web Archiving Consortium, NDIIPP (National Digital Information Infrastructure and Preservation Program in the United States, and PANDORA, Australia’s Web Archive.

Managing collections of Web resources involves making them accessible through good resource discovery metadata and keeping them accessible through preservation metadata and preservation strategies. Tom Delsey pointed out that given the volume of data involved, it may no longer be possible to handcraft metadata as libraries have done in the past, and it may be necessary to rely more heavily on automated indexing processes.

Keeping Web resources accessible as the software required to display them changes depends on access to information about the formats in which they were created. Presentations on the Global Digital Format Repository and PRONOM described these format registries, intended to store detailed information about formats, together with detailed representation information that will enable archives to take appropriate action to ensure that the digital objects remain accessible in the future.

Speakers from national libraries emphasised the importance of legal deposit legislation to support collecting and providing access to Web resources. The United Kingdom and New Zealand have recently enacted legislation which extends the concept of legal deposit to Web resources. Those countries which do not yet have such laws, such as the United States and Australia, are obliged to seek the permission of publishers before archiving and providing access, a time-consuming activity.

A number of speakers referred to the fact that collecting and preserving Web resources is a costly activity, which must receive the appropriate funding. Jim Michalko from RLG (Research Libraries Group) noted that because our experience with the preservation of these materials is still limited, it is impossible to estimate what the actual costs will be. Abby Smith from CLIR (Council on Library and Information Resources) pointed out that archiving Web materials is a public good. While governments are aware of libraries’ and archives’ roles and budget requirements for collecting traditional materials, they do not always understand the scale and complexity of Web archiving.

Two governments of countries represented at the conference have made realistic financial commitments to national Web archiving programs. The US government has funded NDIIPP, a program to preserve significant digital content at risk, with a grant of US$100 million, of which $75 million has to be matched dollar for dollar. The New Zealand government has allocated NZ$24 million for the establishment of a digital preservation program.

Despite the daunting scale of the task ahead, there was optimism for the future. The Web archiving programs established in the past decade have ‘learned by doing,’ and there is now expertise to share in forums such as this conference. There is also the will and the precedents for collaboration towards shared goals.

Information about speakers and a copy of most presentations is available on the conference website.

As a satellite event to the conference, an Information Day was held. This was a day for the practitioners, rather than the policy makers and the theorists, although they were welcome too. A variety of tools and methods required by agencies that are setting up or conducting Web archiving programs were demonstrated. There were 13 presentations on tools such as the Heritrix Crawler, the New Zealand Metadata Extractor, and the Xena tool for electronic normalising of archives. Speakers presentations are available.


 Highlighted Web Site  

DP&I.com



DP&I.comDP&I: A Digital Printing and Imaging Resource

DP&I presents “an online information resource for photographers, digital and traditional artists, printmakers, art educators, art marketers, and anyone interested in the digital printing and imaging revolution.” The site features an up-to-date news archive, product reviews, several “how-tos,” and a number of essays devoted to digital images and art. The left hand navigation list also includes numerous directories of links for suppliers, organizations, message forums, publications, and events. The site’s resources should be a good resource for institutions working with physical, digitized, and especially born digital objects of art.

The website is kept by Harald Johnson, consultant and author of Digital Printing Start-Up Guide and Mastering Digital Printing. He is also the founder and a past moderator of the active message group Digital Fine Art Mailing List.


 FAQ  

Getting SMART About Protecting Hard Disk Drives

Author: Richard Entlich - Cornell University (rge1@cornell.edu)

Our data is backed up, but hard disks have gotten so big that recovering from even a single drive failure is an ordeal. Is there any way to help manage the risk?

Most components of the modern personal computer have undergone dramatic improvements in performance and capacity in the decades since its introduction. CPU clock speeds have increased by four orders of magnitude (ten thousand times) and memory density by six orders of magnitude (a million times).

Although such improvements in electronics are truly impressive, the fact that hard disk drives have kept up a similar pace is nothing short of miraculous. Over the past 15 years, 3.5" drives have seen a four orders of magnitude increase in storage capacity, from roughly 40 MB to 400 GB. Yet these are no motionless slabs of silicon. They are sophisticated electro-mechanical devices built to incredibly tight tolerances. Platters now spin at up to 15,000 RPM (revolutions per minute) with tiny read/write heads floating on a sub-micron (millionth of a meter) cushion of air. Head assemblies flash across the platters, retrieving tightly packed data in a matter of milliseconds.

Given their sensitive mechanical nature, one might expect hard drive malfunctions to be commonplace. Though they do have problems more frequently than computer components without moving parts, modern hard drives are remarkably reliable. Reported failure rates run from under 1% to 3% per year. Nevertheless, with the amount of data stored on a single drive expanding rapidly, any drive failure looms as an onerous and undesirable event. Failures can't always be prevented or predicted, so backups are essential. However, there are steps that can extend the service life of drives as well as provide warning when a drive is starting to ail. We'll discuss both routine (and not so routine) maintenance as well as a widely available but frequently underutilized failure prediction technology called SMART.

Internal view of a hard drive
An internal view of a hard disk drive, with the major components labeled. The heads are attached at the very tip of the narrow end of the actuator arms.

Maintenance

Software maintenance

It wasn't all that long ago that topping lists of computer maintenance tips were suggestions to regularly run a hard disk scanner like Windows’ ScanDisk or Norton Disk Doctor and to use a defragmenter to improve drive performance. These days, users’ attention is more likely to be focused on downloading the latest virus definitions, running anti-virus scans, and removing spyware from their machines. This seems to be not only because various forms of “malware” are thought of as a more serious threat to the integrity of computers and their data, but because hard disk drives have become more robust.

Both anti-virus software and disk scan software help to keep the hard drive functioning as expected, but there is a fundamental distinction. Even though a virus may affect important data (such as operating system components) stored on the hard drive and even prevent a machine from booting, it generally does not physically damage the hardware. Disk scans, on the other hand, can detect areas of the disk that have been damaged and prevent them from being used, thereby restoring the integrity of the device.

Thus, while maintaining a virus- and spyware-free system is extremely important for proper hard drive (and computer) operation, other steps should still be taken to protect the integrity of data on the hard drive. Most modern operating systems include utilities that can repair corrupt directories and some, like Windows XP, also remap bad sectors and optimize file organization without user intervention. There are also third party utilities such as Norton SystemWorks and Gibson Research’s SpinRite that can help retain or restore health to a hard disk suffering from defects in the magnetic media. Defragmentation (the process of rearranging files on a disk so that all portions are physically contiguous rather than scattered across the disk) not only improves performance but also saves some wear and tear on the drive's head actuator.

Hardware maintenance

The notion of hardware maintenance of a hard disk drive may seem a bit odd. After all, there’s nothing to clean, lubricate, or replace. Hard drives have no “user serviceable parts,” and if you did manage to get inside, that act would almost assuredly destroy it. But it is practical and prudent to control the environment in which a hard drive operates.

Temperature

All hard drives are designed to operate within a limited range of temperatures, and today’s drives, with their high speed spindle motors, generate a fair amount of heat. A typical operating range might be from 40-130° F (5-55° C). Overheating affects the longevity and reliability of all electronic components, but it can also shorten the life of mechanical components such as motor bearings and their lubricants. The higher the ambient temperature, the harder time the computer’s cooling system will have keeping the drive within its normal operating range. A computer in direct sun in a hot room may need supplemental cooling to maintain internal temperatures within a safe operating range. However, even drives in machines operated in air conditioned spaces can have their life expectancy extended through supplemental cooling (usually done by adding fans to increase air flow across the drive).

An often overlooked source of overheating is accumulated dust. Years ago, when internal upgrades to memory and other subsystems were fairly common, computer cases were opened frequently, and the dust problem was evident. Modern machines run hotter (due to faster CPUs, high performance video cards, and higher memory capacity, among other things), but are less likely to be scrutinized by users or tech support personnel on a regular basis. They should be checked periodically (perhaps once or twice a year, but more frequently for machines operated in very dusty environments) for accumulated dust around fan intake and exhaust ports. A heavy layer of dust on internal components acts like insulation and should also be removed.

Laptop computers are even more prone to overheating, due to their densely packed components and the difficulty establishing effective internal cooling. About.com has compiled a useful list of steps that laptop users can take to reduce overheating and prolong the life of components.

Not mentioned in the list is the fact that laptop computers are also more likely to spend significant amounts of time in unheated areas (though desktops can also get cold during shipment). Any computer that has been in temperatures below about 40° F for an extended period should be allowed to warm up before being used. If the computer is in a carrying case, it should remain there until nearly fully warmed. This will prevent damage to the hard drive from condensation and thermal shock from too rapid a temperature change. Additionally, though users tend to think of their laptops as “go anywhere” computers, standard laptop hard drives are not meant to be used in below freezing temperatures.

Altitude

Though often thought of as sealed units, most hard disk drives are minimally open to the environment so they can maintain a pressure equilibrium with the outside world. There has to be sufficient air density to provide lift for the drive's heads to float above the platters. Thus most drives cannot be used at altitudes over about 10,000 ft (3000 m) above sea level.1

Vibration and shock

Hard drives are most vulnerable to shock damage before they are mounted within a computer case. Once installed in a system, modern drives are fairly resistant to damage from ordinary bumps and vibrations. This is especially true of laptop drives, which are expected to get knocked about more. However, any drive can be damaged from an extreme force, especially while it is operating, and particularly when it is reading or writing data.

Electrical power

Efforts should be made to provide clean power and minimize power surges and interruptions. Surges can damage drive components; interruptions that occur while the drive is writing data can result in corrupted data and directory entries. It’s always a good idea to run a disk scan utility after a power failure or forced power down from a computer crash. Many operating systems do this automatically if they detect that the computer was previously not shut down normally.

Reliability indicators

Despite attentive maintenance and careful physical handling, no hard drive will last forever. A couple of approaches are possible in order to lessen the likelihood of facing an emergency hard drive recovery or backup restoration situation.

The numbers game

One approach is to replace drives before they reach the end of their service life. Of course, in order to do so, one has to be able to estimate how long a drive is intended to last. Data provided by manufacturers can seem confusing, as several measures of life expectancy are usually provided: MTBF (Mean Time Between Failures), start/stop cycles, and warranty period.

Unfortunately, relying on these measures has major drawbacks. MTBF figures provided for most drives are so large (in the hundreds of thousands or even millions of hours) as to be meaningless. Given the rapid obsolescence of computing technology, being told that, on average, a drive will last the equivalent of 50 or 100 years isn't very helpful. Similarly, the rated number of start/stop cycles (usually in the tens of thousands) would seem to presage an extremely long life, even for drives that are power cycled or put in sleep mode several times each day.

On the surface, warranty figures would seem more reliable as predictors of life expectancy, since the manufacturer has considerable motivation to cover a drive no longer than it thinks it will typically function without failing. Warranties on low-end consumer grade IDE/ATA drives have slipped to just one year. Higher quality drives tend to have warranties of three to five years, but even those can be deceiving. If the profit margin on a high-end drive is sufficient, a longer warranty may be feasible for the manufacturer to offer, even if a somewhat higher percentage of drives will need replacement under warranty.

Therein lies the problem with all these measures—they are based on statistical averages and are influenced by marketing factors. They might be somewhat useful in making purchase and upgrade decisions across a department or large organization. However, they ignore (or average out) natural variations in manufactured quality as well as differences in service life between drives that spend most of their time idling and ones that thrash constantly, like those in many servers. They are not helpful in estimating the health of any particular drive, and that's where SMART comes in.

SMART technology

SMART (which stands for Self-Monitoring, Analysis and Reporting Technology) is a standard developed by disk drive manufacturers to allow a hard disk drive to do a self-diagnostic and report back the results. SMART is based on the notion that certain device failures (particularly mechanical ones) occur gradually, so that sufficient time is available to warn of an impending failure, allowing the drive's contents to be moved to a new replacement under non-emergency conditions.

SMART equipped drives, which includes most ATA/IDE and SCSI drives manufactured in the past decade or so, are able to monitor performance parameters such as spindle motor spin-up time, error rates and retry counts, environmental factors such as internal drive temperature, and actual usage such as total power-on hours, and stop/start cycle counts. This data is then used to judge the health of the drive. For example, if the time it takes a drive’s spindle motor to reach normal operating speed starts to increase, it could indicate a problem with the motor's bearings. When the drive's behavior begins to deviate sufficiently from the expected norm, but before it's so bad as to make the drive unusable, SMART can issue a warning with a recommendation to stop using it. The drive may continue to function, but the user is under advisement that delay in replacing the drive may be hazardous to their data.

Estimates are that 20% to 60% of drive failures are associated with parameters that can be monitored by SMART. Other types of failures resulting from electronic component breakdown, bad solder joints, power surges, or sudden physical shock are usually instantaneous and thus impossible to predict.

Why have drive manufacturers provided this sophisticated self-diagnostic capability? In part, it’s to recognize the important role that the data stored on hard drives has in the operation of most enterprises and the lives of most individuals, but the motivation is not entirely altruistic. When a computer misbehaves, the hard drive is often the first component to be suspected. Manufacturers report that up to 40% of drives returned for repair or replacement under warranty are functioning normally (designated “NPF” for “no problem found”). This costs them a lot of money and often an unwarranted loss of faith in their product by the user.

If carefully implemented so it doesn’t generate too many false positives (i.e., reporting an impending drive failure when none is forthcoming) or false negatives (missing a real impending failure), SMART has the potential to reduce erroneous returns by letting the user know that whatever seems wrong with the computer, a flaw in the hard drive’s operation is not at fault. It also becomes a useful tool for diagnosing an ailing computer.

Putting SMART into practice

SMART can be implemented at a couple of different levels. The most rudimentary is at the BIOS level. Assuming SMART monitoring is enabled in the computer’s BIOS (the settings of which are generally accessible during boot-up by pressing a key such as delete, escape, or F2), the drive's self-diagnostic will be conducted at start-up time and a user alert generated if one of the monitored criteria falls out of range. Such an alert would only be generated if SMART has detected either a failing subsystem (meaning data should be immediately backed up and the drive replaced) or that the drive has exceeded one of its service life parameters such as total power on time or start/stop cycles.

A potentially more fruitful but much less commonly adopted approach is to install SMART compatible software that can regularly query the drive’s health and run a variety of diagnostic tests. Frequently such software is provided with a new hard disk drive, but not installed by the user.

The kinds of diagnostic reports that a drive can generate as well as what kinds of tests can be run varies by manufacturer and model. The original SMART specification required drives to report a table of attributes, the monitoring of which over time can provide early clues to a drive’s deteriorating health. Although the current SMART specification only requires drives to report whether they have passed or failed a battery of tests, most still support the reporting of attributes.

In most cases, the drive reports a normalized value (between 0 and 253) for each attribute and this is compared to a threshold value. If the normalized value drops below the threshold value, then the drive is considered to have gone into failure mode, or to have exceeded a physical operating parameter like temperature, or to have exceeded its life expectancy. The actual value for each attribute that the manufacturer uses as the failure threshold is, in most cases, considered a trade secret and thus not revealed. However, actual values for attributes such as power on hours, start/stop cycles, and temperature are usually provided.

SMART software

Pretty much every major disk drive manufacturer makes SMART compatible software available for free from its website. A partial list of such packages can be found at the SMART Linux site. These tools vary substantially in design and capability. Some are designed to be run from a bootable floppy disk in command line mode (e.g., under DOS) on the assumption that the software might be needed because the hard drive is malfunctioning. Obviously the computer has to have a floppy drive in order to use these. Others have graphical user interfaces. Most are capable of simply reading the drive’s SMART status and running either a short or long drive function test (the actual time required varies considerably depending on the drive's capacity). SMART software made by drive manufacturers should be paired with that company’s own hardware. Check the website for compatibility with specific models. Manufacturer-specific software will often provide a special diagnostic code if the drive fails a SMART test that can then be used to get the drive serviced or replaced under warranty.

There are also generic SMART packages designed to work with a variety of disk drives from different manufacturers. Some of these are also listed on the SMART Linux site. One of the most popular is the open source smartmontools. Originally derived from the Linux smartsuite package, smartmontools is now available for most versions of Linux, Unix (including MacOS X), and Windows. Smartmontools consists of two utilities that can provide on-demand or scheduled monitoring and testing of hard drive health. Smartmontools allows fine tuning of reporting such as setting a threshold value for drive temperature that you consider worthy of action (which may be considerably below what the manufacturer uses as its threshold) and allowing alerts to be emailed. Smartmontools can be used on any SMART enabled drive (though it provides data beyond the baseline SMART report for drives it already knows about) and can see drives behind some RAID controllers.2

ID#

ATTRIBUTE_NAME

VALUE

WORST

THRESHOLD

TYPE

WHEN FAILED

RAW VALUE

1

Raw_Read_Error_Rate

068

067

006

Pre-fail

-

190439331

3

Spin_Up_Time

098

097

000

Pre-fail

-

0

4

Start_Stop_Count

100

100

020

Old_age

-

235

5

Reallocated_Sector_Ct

100

100

036

Pre-fail

-

0

7

Seek_Error_Rate

071

060

030

Pre-fail

-

14310365

9

Power_On_Hours

098

098

000

Old_age

-

2080

10

Spin_Retry_Count

100

100

097

Pre-fail

-

0

12

Power_Cycle_Count

100

100

020

Old_age

-

627

194

Temperature_Celsius

036

040

000

Old_age

-

36

195

Hardware_ECC_Recovered

068

067

000

Old_age

-

190439331

197

Current_Pending_Sector

100

100

000

Old_age

-

0

198

Offline_Uncorrectable

100

100

000

Old_age

-

0

199

UDMA_CRC_Error_Count

200

200

000

Old_age

-

0

200

Multi_Zone_Error_Rate

100

253

000

Old_age

-

0

202

TA_Increase_Count

100

253

000

Old_age

-

0

An edited SMART attribute table from smartmontools. VALUE refers to the current normalized value for the attribute, WORST is the worst value ever recorded for that attribute and THRESHOLD is the value for that attribute that the manufacture has deemed a failure condition or predictor. The raw values, where available, are shown in the last column. For example, this drive has seen 2080 power-on hours. Note that there is no way to tell how many power-on hours the manufacturer considers to be too many, but with SMART activated, an alert would be generated when that amount of use was reached. The particular attributes measured are also vendor specific and vary from drive to drive.


Should you use SMART?

SMART technology has some shortcomings that might cause users to question its value. Though the basic interface and means for reporting SMART data is standardized, the particular attributes measured and the values used as thresholds for failure are entirely up to the manufacturer. SMART was designed to handle up to 30 different attributes, but most drives report far fewer. The ability to report temperature, for example, depends on the existence of a temperature sensor within the drive, which not all models offer.

Also, some critics of the technology dispute its effectiveness. Anecdotal evidence is cited to show that SMART may only provide alerts for 10% to 20% of drive failures, and in some cases, 0%. Whether or not these reports are representative, there does seem to be a lack of large-scale, real-life testing of just how effective SMART is in predicting drive failure and avoiding catastrophic loss of data.

However, there also doesn't seem to be much of a downside to using SMART, as long as one is aware of its limitations. The software is largely free and easy to install and use. With individual hard disk drives of half a terabyte now available and multi-terabyte server arrays commonplace, it makes sense to take steps to lessen the possibility of data and productivity loss associated with the unexpected failure of such mammoth storage devices. Attention to environmental conditions and use of SMART monitoring are two common sense strategies for reducing risk of sudden hard drive failure.

Notes

1 This limitation doesn't apply to use in the passenger cabins of airplanes, even though cruising altitudes often exceed 30,000 feet. Passenger compartments are pressurized to maintain an environment equivalent to an altitude no higher than about 8,000 feet above sea level.

2 RAID (for Redundant Array of Independent [or Inexpensive] Disks) is a popular mechanism for increasing hard drive data reliability through redundancy and error correction beyond that provided on standalone drives. Despite their added reliability, such drive arrays can still benefit from an alerting mechanism when one is about to fail.


 Calendar of Events  





Audio-Visual Content and Information Visualization
May 4 – 6, 2005
Cortona, Italy

This DELOS affiliated workshop is targeted to managers of digital audiovisual materials as a forum to disseminate the latest developments and applications in the areas of multimedia content and information visualization in digital libraries.

Digital Preservation Management: Short-Term Solutions to Long-Term Problems
May 15 – 20, 2005
Ithaca, New York

Cornell University Library is pleased to announce continuation of the Digital Preservation Management workshop series. This limited enrollment workshop has a registration fee of $750 per participant. Registration opens March 14 for the May workshop. Additional offerings of the workshop will be held in July and November 2005.

Information Resources Management Association International (IRMA) Conference

May 15 – 18, 2005
San Diego, California

The theme of the 16th IRMA international conference is “Managing Modern Organizations With Information Technology.” The conference encompasses a plethora of tracks including “Electronic Government Research,” “End User and Organization Computing,” and “Text Database and Document Management.”

IT Training for Practicing Archivists Series
May 20 – 21, 2005
Boston, Massachusetts

Presented by the Society of American Archivists (SAA) in two sessions, “Digital Libraries and Digital Archives” and “Digitization of Archival Materials,” these seminars provide an overview of standards, terminology, concepts, workflows, and technology related to digitization and digital archives.

School for Scanning
June 1 – 3, 2005
Boston, Massachusetts

This Northeast Document Conservation Center (NEDCC) conference will span topics from content selection to business management to preservation of paper-based text, photographic images, and analog audio and video objects.

Summer Educational Institute for Visual Resources and Image Management
July 5 – 9, 2005
Durham, North Carolina

Sponsored by the Art Libraries Society of North America (ARLIS/NA) and the Visual Resources Association (VRA) this educational offering targets an audience with a broad range of backgrounds and experience who are interested in managing image collections, particularly related to the transition from analog to digital visual resources.

Joint Conference on Digital Libraries
June 7 – 11, 2005
Denver, Colorado

The Joint Conference on Digital Libraries (JCDL) annual conference theme for 2005 is “Cyberinfrastructure for Research and Education.”

Stayin’ Alive: Long Term Preservation of Digital Files
July 13, 2005
Southborough, MA

NELINET is a cooperative of more than 600 academic, public, and special libraries in the six New England states. This seminar is about the challenges of digital preservation and best practices for maximizing the longevity of digital collections, as well as costs and options for long-term storage and maintenance of digital files.


 Announcements  





The Future Digital Heritage Space
The DigiCULT Consortium has released its seventh thematic issue entitled “The Future of Digital Heritage Space.” Using maratime expedition metaphors, the report presents and discusses the next waves technological innovations which “...may significantly shape and re-shape the digital landscape in which heritage organisations reside.”

Managing Web Records Guidelines
The National Archives and Records Administration (NARA) has released a new set of guidelines applicable to program staff, webmasters, IT staff, and others responsible for website management and administration.

SHERPA DP: Creating A Persistent Preservation Environment For Institutional Repositories
The Arts and Humanities Data Service (AHDS) has announced SHERPA DP, a two year project funded by JISC (The Joint Information Systems Committee). The project aims to create a preservation service layer based on the OAIS reference model for managing the life cycle of the e-prints archive maintained by SHERPA project partners (SHERPA: Securing a Hybrid Environment for Research Preservation and Access).

Western States Dublin Core Metadata Best Practices Version 2
Institutions in a multi-state initiative to create a virtual collection of distributed digital resources on the topic of Western Trails participated in the revision of the Colorado Digitization Program’s existing General Guidelines for Descriptive Metadata Entry and Creation (1999). The first version debuted in January 2003.

Giza Archives Project website launched
With ongoing funding from The Andrew W. Mellon Foundation, the Museum of Fine Arts, Boston (MFA) has launched an interactive website to provide access to excavation diaries, photographic materials, object register books, maps, plans, and sketches from the ancient tombs and pyramids at Giza in Egypt.

OpenReader
The new OpenReader initiative is developing an open source reading application based on accepted and open standards. Publication types targeted for support include ebooks, periodicals, newspapers, and other page-based documents. The format will be platform-independent, readable on a wide variety of computing devices, and capable of high quality typographic presentation.

Training for Audiovisual Preservation in Europe
TAPE (Training for Audiovisual Preservation in Europe), a 3-year project funded by the EU’s Culture 2000 program, has launched a series of activities for raising awareness about audiovisual preservation, especially in institutions who do not specialize in AV or have dedicated resources for dealing with AV materials. TAPE plans to organize working groups, seminars, and training workshops.

ePrints Soton
e-Prints Soton, the University of Southampton e-Prints Service, will now provide free, internet-based access to its e-Print archive. The repository was established in 2002 to house electronic copies of research output such as journal articles, book chapters, conference papers, and multimedia.

Scientific Data and Information
The International Council on Science (ICSU) has published the report “Scientific Data and Information” based on an assessment of strategic issues of scientific data and information. The report covers the lifecycle of data management highlighting the need for planning for long-term management and preservation.

ERPANET Sustainability Plan
ERPANET has announced the public release of its sustainability report outlining its business plan and new work planned for 2005.

Virtual Data Center Announces Update
The VDC updated its software to version 1.0.2, adding support for Fedora Core 2 and HTTP proxy operations among other enhancements. The VDC is an open source digital library system for virtual collections of quantitative data.

New Digitization Blog
A new blog, digitizationblog, presents a news source and an idea sharing forum for people involved with digitization and related activities in libraries, archives, and museums.

OCLC teams up with Safe Sound Archive
OCLC Content Conversion Services teams up with Safe Sound Archive “to provide digitization services for libraries audio collections, including digital reformatting, archiving and improved access through Open WorldCat.” The first pilot project for the new partnership will be to digitize interviews from Columbia University’s Notable New Yorkers collection.


 RLG News  

Descriptive Metadata Guidelines; The EAD Report Card



Descriptive Metadata Guidelines for RLG Cultural Materials

These new Descriptive Metadata Guidelines help those who are dazed by an increasingly bewildering array of concepts and standards in the field of metadata to make informed choices about describing their collections. Written by a group of experts from a library, archival and museum background, the guidelines find the common ground among the different communities by outlining a strategy which can be implemented using any community-specific suite of standards. The document features introductory sections clarifying concepts and terminology, as well as detailed mark-up examples in its chapter on the core fields and their data values for submission to Cultural Materials. Several appendices profiling standards and specifications round out the offering. The guidelines have been reviewed in the January 2005 issue of CurrentCites, where Roy Tennant states that the document is "chock-full of excellent advice, useful examples, and hard-won metadata wisdom. It should be required reading for anyone working with metadata."

To download the Descriptive Metadata Guidelines for RLG Cultural Materials, please visit RLG's website.

Introducing the EAD Report Card

RLG's EAD Report Card (released January 2005) is the first automated program for checking the quality of your EAD encoding. Created by popular demand, this Web application supplements RLG's award-winning RLG Best Practice Guidelines for Encoded Archival Description. Choose a finding aid and the program will flag any discrepancies and take you to the relevant section of the encoding guidelines.

RLG plans to make the EAD Report Card code available as open source in early 2005. You will be able to download it and run it on your desktop, which should make the checkup even faster. As an open-source tool, it can also be programmed to reflect your local best practices.

RLG commissioned the EAD Report Card as part of our continuing commitment to making archival collections more accessible on the Web. In addition to the guidelines and the report card, RLG also provides access to RLG Archival Resources, a database of archival materials. All institutions are encouraged to submit their finding aids to this database.


 Publishing Information  





RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org.

Materials in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given to use material found here for research purposes or private study. When citing RLG DigiNews, include the article title and author referenced plus "RLG DigiNews." Any uses other than for research or private study require written permission from RLG and/or the author of the article. To receive this, and prior to using RLG DigiNews contents in any presentations or materials you share with others, please contact Jennifer Hartzell (jlh@notes.rlg.org), RLG Corporate Communications.

Please send comments and questions about this or other issues to the RLG DigiNews editors.

Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); Technical Researcher: Richard Entlich; Contributor: Ellie Buckley; Copy Editor: Martha Crowe; Production: Jenn Colt-Demaree, Carla DeMello.


All links in this issue were confirmed accurate as of February 14, 2005.


Copyright 2004 RLG.