 |
 |
 |
 |
 |
 |
 |
 |
Feature Article 1 |
|
 |
 Digitizing the World’s Largest Collection of Natural Sounds: Key Factors to Consider when Transferring Analog-Based Audio Materials to Digital Formats
Author: Robert W. Grotke - Cornell University (rwg4@cornell.edu)
 |
 |
 |

Overview
In 1999, through generous awards from the National Science Foundation and the Andrew W. Mellon Foundation, the Macaulay Library of Natural Sounds (MLNS)[1] at Cornell University began the enormous task of converting its analog tape-based sound collection to digital storage. Our collection contains over 160,000 recordings of bird, insect, frog, and mammal vocalizations. Analog formats included acetate disk, cassette, and open reel. A considerable number of the open-reel tapes were in various stages of deterioration and thus required specialized treatment prior to transfer. The project began with three archival studios, but with the help of additional funding from the National Science Foundation and the Office of Naval Research we are currently running six fully equipped studios. This article provides an overview of the critical steps relevant to the digitization process. All photos are courtesy of the Macaulay Library.
Tape Inspection
Prior to beginning the digitization process, a careful review of the materials to be transferred was undertaken. As mentioned above, many of our Mylar-based open-reel tapes had already begun to deteriorate. Some of this deterioration, also known as sticky-shed, is the result of binder breakdown and requires that the tapes be stabilized prior to transferring.
![[Analog files]](file106.gif) Storage facility for analog collection
This process consists of controlled baking (typically 50° C for a period of 24 hours) to temporarily improve binder integrity.[2] Other tapes had splices that needed either repair or replacement. The latter requires a great deal of skill and patience to carefully release the old splice and adhesive without damaging the fragile oxide layer. Still others would require a special relubrication process to minimize friction.[3] The final phase of inspection was to identify the track format. Many of our collection recordings have good, solid data records that accurately describe the make and model of the audio recorder used to create the tape. For those that do not, or are questionable, we use a magnetic developer. This product allows one to literally see the magnetic track format on the tape created by the record head, e.g. full-track, ½-track, ¼-track, 4-track, etc. Once the track format is identified, then the proper head assembly can be selected for an accurate playback.
Playback Equipment Calibration
To extract every nuance of fidelity from the original analog tapes, we took great care to ensure that the master open-reel playback machines (Studer A-820) were in excellent condition. Playback heads and stationary tape guides were visually inspected for excessive wear patterns that could potentially degrade the transfer through increased friction. Tape tensions were adjusted to ensure intimate tape-to-head contact with the least amount of tension, while spooling speeds were set low to provide gentle handling of fragile tapes. Each machine was then calibrated to known international standards using a series of precision calibration tapes from Magnetic Reference Laboratories, Inc. This calibration included adjusting head alignment (height, wrap, and azimuth), playback equalization, playback reference levels, absolute speed, and wow and flutter. Master cassette playback machines (Nakamichi CR-7A) were similarly inspected and calibrated using a series of calibration tapes from BASF prior to beginning the transfer process. The initial calibration/alignment process and all subsequent ones (machines are calibrated biweekly) were accomplished using a computer-based test set manufactured by Audio Precision. The resulting test data were stored and routinely compared to current tests. This allows us to closely monitor the machine’s performance over time and makes it relatively easy to spot problems before they have a negative effect on the transfer process.
Analog-to-Digital Conversion
Perhaps the most important link in the digitization process is the analog-to-digital (A/D) converter. Often overlooked and taken for granted, this sole piece of equipment can make or break the digitization/preservation effort. In our situation we knew that we had many outstanding-quality, wide-spectrum open-reel field recordings. Our ultimate goal was to digitize without compromising quality in any manner whatsoever. To achieve this goal, we first reviewed A/D converters based on published specifications, then narrowed the playing field down to six possibilities and finally requested actual units for in-house testing and audition. The results were nothing short of amazing. Even though all six had very similar published specifications, the actual sound character or lack thereof was very different. Our final decision, the Prism Dream AD-2, was the only device that did not color (alter) our signals. Through grueling A/B listening tests and spectral analysis we confirmed that the digitized signals created by the Prism device were indistinguishable from our highest-quality analog sources.
What makes a good A/D converter? Many key factors determine the quality of a good converter system. To make the proper selection first, we needed to make sure that the device could handle the full-frequency spectrum of the signals to be digitized. Second, we determined the dynamic range required. For us that meant a converter that used a 96.0kHz sampling rate to cover a signal range from 4Hz through at least 32kHz, and 24-bit output resolution to provide a dynamic range of roughly 128.0dB (unweighted RMS). The Nyquist Theorem states that the sampling rate must be at least twice the highest frequency of interest to achieve lossless sampling. [4]
Once these criteria had been defined, we looked for the following other important specifications:
- Total harmonic distortion and noise (1kHz@ -1dBFS)
<-108.0dBFS (0.00045%) unweighted RMS
- Intermodulation distortion
<-90dB
- Spurious aharmonic levels
<-130dBFS (1kHz@ -1dbFS)
- Interference susceptibility
>110dB (50hz) CMMR >85dB (15kHz) CMMR
- Crosstalk
<-130dB (50Hz, -1dBFS in either channel, other terminated) <-140dB (15kHz, -1dBFS in either channel, other terminated)
- Intrinsic conversion jitter
<18ps RMS)
- Phase linearity
<1º
- Internal high-precision clock accuracy
±5ppm
Specifications like these are not typically found in the sound cards that often come built into, or bundled with, computers. Nor are they found in stand-alone compact disc recorders. The precision needed to execute exacting A/D conversion requires ultra-clean power supplies; exceptional grounding procedures; ultra-stable clocking devices; high-quality, low-tolerance components; and exceptional printed circuit board design. All of this comes at a high price, but the end results are certainly worth the cost.
Transfer-Level Setting and Monitoring
![[Archival studio]](file51.gif) One of the six archival studios
Another key element in the digitization process is setting the transfer signal level. The MLNS uses Benchmark Media ultra-low noise, low-distortion preamps as the variable-gain stage between the playback devices and the A/D converters. To fully use all the available digital resolution, e.g., 16-bits or 24-bits, it is important to maximize the analog signal level being fed to the A/D converter. At the same time, great care must be taken to never exceed the maximum allowable input, so that the A/D converter is never overdriven during the transfer process. It is far better to adjust the signal levels before the A/D process rather than try to use a “normalize” function once the signal is digitized. The Prism A/D converters provide sample-accurate precision metering that aids in this critical step.
In addition to the meters, a second form of signal monitoring is accomplished aurally, with high-resolution, near-field monitor speakers. To make this possible, the digital output signals are converted back to analog via high-resolution Prism DA-2 digital-to-analog converters (D/A). Once again, special care was taken to make the D/A selection. To make accurate quality assessments, the D/A must be able to resolve to the same degree of precision as the A/D converter. The monitor speakers are then coupled with a precision switching device allowing the archivists to easily switch the monitors between the feed from the master playback device, the output of the Benchmark preamp, and the output of the D/A converter. Any aural difference is a sign of potential problems in the transfer chain.
Digital Editing System
Once the analog signals have been converted to digital, a digital audio workstation (DAW) is used to edit (if needed), create fade-ins/-outs, add cataloging information (voice ID), add unique file names to each individual recording, and build the DVD project files that will ultimately be written out to DVD-R discs. We use a Sonic Solutions, Sonic Studio HD system. Once again, great care was given to the DAW selection. The Sonic system was selected for its ability to preserve the signal quality from initial recording (digital input) to final digital output by incorporating a full 48-bit data path throughout the system.
Most workstations use 32-bit floating-point math to perform their editing, processing, and gain changes. Thirty-two-bit floating point uses 24 bits of precision (a 23-bit mantissa and one sign bit) and 8 bits of exponent. This exponent is useful for preserving precision over a wide frequency range but does not actually add to the maximum precision.
Sonic Studio HD uses 48-bit math in all audio processing. It is set up to provide 40 bits of precision and 8 bits of linear headroom. The 40 bits of precision in Sonic Studio HD offer 16 bits of additional precision compared to 24-bit audio samples. This greatly reduces the round-off error for audibly superior performance.
Digital Media and Data Format Selection
Several factors drove our decisions regarding the selection and formatting of a long-term storage medium. We had already determined early on that our digital collection was going to be optical-disc rather than tape based. We also knew that our storage requirements were going to be huge (roughly 32MB/minute of 2-channel audio) due to the high-resolution (96kHz sampling rate, 24-bit word length) digital audio files we would be generating. We examined in depth the three options available at the time: CD-R, DVD-RAM, and DVD-R. CD-R certainly had merit with good player compatibility, excellent life expectancy, and low cost but fell short in terms of capacity. DVD-RAM offered increased capacity and long life expectancy but was expensive and did not offer the archival security of a write-once format. DVD-R, however, offered everything we required. Initially it, too, was relatively expensive, but we knew, based on the CD-R’s history, that as the technology gained momentum, the costs would fall significantly.![[breakout quote]](file113.gif)
The next issue at hand was how to format the data. Our goal was to make the digital collection as generic as possible, thereby maximizing accessibility and setting the stage for easy migration to the next generation of digital storage. Initially we contemplated using the DVD-Audio format (similar to CD-Audio) but soon realized that industry-imposed copy-protection schemes would significantly hamper our accessibility and future migration requirements. Instead, we chose to write each disc as a DVD-ROM using the Universal Disc Format (UDF) standard. Our audio recordings reside on the discs as Audio Interchange File Format (AIFF) data files. Every audio file has a voice ID at the beginning of the selection announcing the asset number, which in our case is the MLNS catalog number. This same number is also used as the digital file name. No other metadata is embedded in the audio file. Due to the common occurrences of species splitting and renaming we chose to store all relevant metadata in a separate relational database.
Disc-Writing Strategies
Our DVD-R discs are written using Pioneer DVR-S201 recorders. These devices are professional writers that use the DVD-R 4.7GB Authoring version 2.0 media. These media require a 635nm-laser wavelength instead of the 650nm-laser that the general-use versions utilize, however, they are still fully compatible with all DVD readers. The discs, designed for authoring and replica masters, are generally of a higher quality and more consistent from batch to batch than the general-use discs. We currently use discs from Maxell, TDK, and Pioneer and purchase in lots of 100 to 200 at a time. This purchasing strategy will help minimize any catastrophic batch-related problems. (View enlarged image) ![[Plasmon D-480 robotic jukebox]](file3699.jpg) DVD jukeboxes
A custom DVD authoring program from Sonic Solutions handles the actual disc formatting, writer control, and bit/bit verification. Write speed is approximately 60 minutes/4.3GB of data. We do not write the disc to the full 4.7GB capacity. Our in-house testing has revealed that the disc quality decreases near the extreme outer diameter of the disc, so we limit our data to 4.3 to 4.4GB/disc.
We also create two, first-generation discs for our archive. Each disc contains roughly 125 minutes of stereo or 250 minutes of mono material. One disc is placed in a large Plasmon D-480 robotic jukebox for in-house distribution, while the second is stored off site at a secure, climate-controlled, underground storage facility.
Disc-Quality Control
Having many years of hands-on experience testing CD-R technologies, we are well aware of the more-subtle problems associated with CD-Rs caused by writer/disc compatibility, writing-speed issues, and dye-formulation problems. Many of these can have a negative impact on the discs’ playability over time. It is a little-known, but true, fact that all blank discs will not perform equally well in all writers. The differences often manifest themselves as significantly higher-than-acceptable error rates or tracking problems. While this may not pose a playability issue immediately, thanks to the error correction/concealment systems employed, there might come a point where the slightest disc degradation could render the disc useless. DVD-R technologies appear to share some of these same issues.
With the above in mind as we grow our digital archive, we do everything in our power to ensure that each and every disc is the very best quality, thereby maximizing its useful life. To reach that goal, every disc created undergoes a series of rigorous tests.
Using an AudioDev Computer Aided Test System (CATS), we first test every blank disc. During this test phase the disc is subjected to 20 different tests that measure such values as disc reflectivity, push-pull, wobble signal-to-noise, land pre-pit level, block error rate (BLER), etc. A successful test will certify the blank disc’s ability to perform to specification during the writing phase. The ability to test blank, unwritten discs is a valuable asset. Not only do we save time and money by not writing on known defective discs, but we also save valuable hours on the expensive writing lasers.
The next and final step in the Q/C process is the disc verification after writing. The primary goal of this test phase is to quantify the quality of the writing process. Over 50 important parameters are tested during this phase, including servo and tracking, jitter analysis, digital errors, dropouts, HF parameters, and physical measurements. The results of this process offer a pass/fail report detailing all tests and their respective values. The CATS also provides surface-analysis testing to help reveal defects due to disc flatness, focus error, radial noise, and other anomalies that can occur during the manufacturing process. Failed discs are carefully scrutinized and either retested or rejected.
Archive Monitoring
All of the discs’ Q/C data are stored electronically. Discs are randomly pulled from the local jukeboxes and retested. Current test data are compared with prior test results to monitor the integrity of the digital medium. Any disc degradation or manufacturing-batch-related problems can be readily identified and digital clones created on new stock.
Accessibility/Distribution
We consider the DVD-R discs our high-resolution “core-archive.” These hi-res versions are available only in-house via a high-speed network. For external distribution via the Internet we create a variety of down-sampled versions. These include a CD-Audio quality 44.1kHz/16-bit wave file, a 96kbp/s MP3 file, a multi-bitrate RealAudio streaming file, and, coming soon, a QuickTime streaming file. All these files reside locally on a 25 terabyte Apple Xserve RAID storage system. A full backup of these files is maintained off site using an Exabyte LTO tape library system.
Future Proofing
Only time will tell, but if history is any indication, we assume that in the not-too-distant future some new and better digital format for long-term preservation will appear in the market place. Unfortunately, as technology changes, it typically renders existing hardware and software obsolete. In the meantime we will monitor new digital storage technologies while continuing to grow what we believe to be a very robust and accessible digital storage solution. When an improved, standardized digital format does appear, we feel confident that we have set the stage to migrate data from our current storage strategy to the next generation in a relatively painless and automated fashion.
Notes
[1] The MLNS Web site is is currently being completely reworked. Once set up, it will be easily found from a link on the Laboratory of Ornithology’s main site. (back)
[2] Van Bogart, John W.C., “Magnetic Tape Storage and Handling,” National Media Laboratory, June 1995.(back)
[3] Stosich, Michael N., “Problems and Solutions in Long-Term Tape Performance,” audio, November 1990.(back)
[4] Pohlman, Ken C., “Principles of Digital Audio,” 2d ed., Howard W. Sams & Company, 1989.(back)
 |
 |
 |
 |
 |
 |
 |
 |
 |
Editor's Interview |
|
 |
 Digital Preservation Coalition
Author: Maggie Jones - Digital Preservation Coalition (DPC) (info@dpconline.org)
 |
 |
 |

Editors' Note This piece continues our series of Editors' Interviews that allow us to get current information on relevant programs, projects, and initiatives. The Digital Preservation Coalition (DPC) is an innovative organization in the UK that resulted from the efforts of a group of determined professionals to coordinate and focus the digital preservation activities at both the national and international level. Maggie Jones is the Coordinator and Company Secretary of the DPC and has provided the following answers to questions we posed.

Could you provide some background on the DPC and its work?
Here's a bit of history. The concept of a coordinating mechanism was first proposed at a workshop held at the University of Warwick in 1999. During a review of progress since the first Warwick Workshop, held in 1995, one recommendation from the 1999 Workshop was to establish a coordinating body with representation from different sectors to promote awareness of digital preservation issues. It also proposed the establishment of a new position in the Joint Information Systems Committee (JISC) to focus on digital preservation, as it was becoming increasingly clear that this could not be done in addition to other responsibilities. The JISC Digital Preservation Focus position was subsequently established, and Neil Beagrie took up that post in June 2000, with an early priority of setting up the organization recommended at the Warwick workshop. The DPC was formed following a summit in January 2001 with seven founding members and was subsequently incorporated as a company limited by guarantee in July 2002.
Your Web site defines goals and objectives for getting started. Two years and a bit down the road, would you review your accomplishments to date, your current and near-term priorities, next steps?
We've just had our first Annual General Meeting of the DPC, celebrating its first year of existence as a company. This provided an opportunity to review what has been achieved in the first year and also to look forward. The first Annual Report will be available from the DPC Web site in the near future. It indicates a strong advocacy and awareness-raising element. The DPC hired a PR consultant to help get digital preservation into the media and generally spread the word. This has been enormously successful, with twenty-five articles on digital preservation appearing in the national media during the DPC's first year of operation. DPC Forums have also proved to be a very effective means of disseminating information about what's happening, keeping DPC members and other interested parties well informed, and also providing opportunities for networking.
As we move forward, we want to build on this solid foundation but move away from general awareness raising to a more-targeted approach. For example, one focus for January-June 2004 will be drawing attention to the newly created Digital Preservation Award to recognize leadership and achievement in digital preservation. It is sponsored by the DPC and included as part of the Pilgrim Trust Conservation Awards. There are five worthy contenders shortlisted, and the winner will be announced at a ceremony at the British Library in June.
We also want to engage the attention of those who are in a position to fund the development of digital preservation infrastructure. One of the overarching objectives in the workplan is to get digital preservation on the agenda of key stakeholders in terms that they will find persuasive and understand.” The UK Needs Assessment, which we have just embarked on, will be a major means of achieving this over the next twelve to eignteen months. The first stage of the assessment was to survey all DPC members. This has yielded a wealth of information that we're still mining and will be used to assist in presenting the case for increased investment in digital preservation.
The survey also strongly reinforced the need for training in digital preservation. This will be an important component of future work. As a first stage, we're hoping for JISC funding of a report, which will investigate options and costs of developing an in-depth training program.
There is a members-only section on your Web site. What kind of content does that section contain?
We keep our administrative stuff there—records of DPC Board meetings and so on. We also have a section of members documents, policies that DPC members have developed and have been prepared to share with others who might use them as models for their own organizations. A Technology Watch section contains references to relevant documents on standards and formats, and we recently commissioned three reports to add to this section. The intention of all of them is to present quite complex concepts and developments without recourse to unnecessary or confusing jargon. One is a joint DPC/OCLC report on OAIS, authored by Brian Lavoie, that will appear in the Members section of the site very soon. This is a great example of a development (OAIS) that is clearly so significant for the development of digital preservation infrastructure, but it can be a fairly intimidating document to read without the help of a user-friendly guide. This is what we hope the Technology Watch reports will provide. We'll also be looking at developing the Members section of the Web site further in the coming months.
Would you talk about membership (categories—institutions, individual, UK-only etc., benefits, numbers, expected/desired growth)? Have there been any unexpected developments in your membership (examples: institutions you didn't expect to join, kinds of institutions you expected to get but didn't)? What kind of roles do members play?
There are three categories of membership at the moment: full members, associate members, and allied organizations. We're just embarking on a review of our membership structure, as it's timely to revisit the categories and to clarify the goals and benefits of membership. We expect the review to be completed in time for membership renewals in July.
The DPC operates both as an entity with a set of core activities and a work plan, and also through the individual activities of our members. This is crucially important, because our members include those organizations that have already taken on a leadership role in tackling digital preservation challenges, and we need to exploit that effort as much as possible. The recently conducted DPC survey revealed a wealth of information on member activity, and we're currently preparing a list of digital preservation projects being undertaken by DPC members.
You refer to stakeholders throughout your organizational documents. Who would you identify as your stakeholders? Have they changed since you began your work? What are some examples of strategic partnerships that you are (or anticipate) forging?
Our primary stakeholders are our members, and we try to keep their differing needs and requirements firmly at the forefront of our activities. In addition, aside from the potential funders of digital preservation referred to above, we've been keen to ensure that we have formal relationships with similar organizations, for example, the National Preservation Office of the UK and Ireland, with whom we have a memorandum of understanding. Securing the preservation of digital resources in the UK and working with others internationally is another major goal for the DPC, and we've been determined to build and maintain contacts with organizations who are putting effort into digital preservation activities overseas. So we have an MOU with the National Library of Australia, and we're also developing a similar document with the Library of Congress's NDIIPP program.
We're also keen to forge a partnership with Cornell. We were enormously impressed by their online tutorial and workshop and would like to develop that model here. We've had some informal discussions to date, but it would be great to develop that further. We see the development of an intensive training program as requiring significant preparation and planning, so it won't necessarily achieve a rapid result, but we're convinced the end result will be well worth the wait.
The DPC is also represented on a number of task forces and working groups as a means of keeping in touch with colleagues working overseas and contributing to the overall progress toward common goals. The PREMIS Working Group and RLG/NARA Task Force on Certification of Digital Repositories are two examples of this effort.
The next DPC Forum, which will be held on June 23, 2004, at the British Library, is focussing on international digital preservation developments. It will be a great opportunity to hear about significant developments and consider ways we can most effectively work together.
Yes, it really seems to have tapped into a need, and we're very keen to develop the online version further. We've used the print Handbook as the basis for a series of training workshops we've done for DPC members. We'd like to see the online version developed further, and it might be possible to do this as part of our longer-term plans for more-intensive training programs.
What is the relationship between the DPC and JISC?
I think a lot of people are very confused by this, not least because JISC played such a pivotal role in establishing the DPC, and because of that, the registered office for the company is still JISC for the moment. So I think a lot of people think the DPC is actually part of JISC, which is not the case. The DPC is a separate legal entity from any of its members, and includes twenty-six other members as of January 2004, in addition to JISC. The DPC is also cross-sectoral and has been deliberately so from the start, whereas JISC is of course confined to the UK Higher Education and Further Education sectors.
Another key development that is worth mentioning here is the forthcoming National Digital Curation Centre. This new initiative is being jointly funded by JISC and the e-Science Core Programme for an initial three-year period. The work of the NDCC will be aimed at the specific needs of UK Higher Education and Further Education sectors and will address issues of pressing concern by undertaking research, developing tools to support effective digital curation, developing standards and certification, and piloting services. At the time of writing, it was still to be officially announced, but it will be established in the very near future. This is another organization the DPC will expect to work very closely with, and we're looking forward very much to that.
Recently Neil, a founder and key player in the DPC, moved to a new liaison position in the British Library to strengthen the ties between that organization and JISC and work on new opportunities for collaboration. What implications does his new position have for the DPC? What kinds of initiatives might occur as a result?
I think this is an incredibly positive move for all concerned. The DPC can only benefit from having Neil in a liaison position between two major members of the DPC, and I expect to still be working very closely with him.
It's a little premature to predict any specific initiatives yet—Neil started in the role only in January! Neil's role in the BL/JISC position is wider than digital preservation, which is even better, I think, especially as the lines between digital preservation and other digital library activities increasingly blur. So having someone in that new role who knows and understands the working of the DPC as well as Neil does will be far likelier to open up opportunities for collaboration with the DPC.
Would you describe the current and future funding (known and potential) that enables DPC's work?
The DPC has been entirely dependent on membership fees and, in particular, fees from our full members. This has provided necessary foundation funding and a build-up of reserves. It would also not have been established without the support of JISC, which has not only been a full member of the DPC from the beginning, but has also provided essential staffing support that enabled the DPC to make substantial progress until a permanent staffing base could be secured.
As the DPC program of work becomes more ambitious, we will also be seeking additional funds to help us achieve specific objectives. For example, Resource, the Council for Museums, Libraries, and Archives, is helping us bid for funding from the New Opportunities Fund for a survey of smaller regional organizations, many of whom will have been recipients of NOF digitization funding. This is an essential component of the UK Needs Assessment exercise we're undertaking—I'll come back to that later. We're also hoping to secure funding from JISC to prepare a report on training requirements and options.
What does DPC staffing look like at this point?
You're talking to the DPC staffing! The new post of DPC coordinator was established in May 2003, and I took on that role, working very closely with Neil, who maintained his role as DPC company secretary until November 2003, when I felt able to take that on. We employ a webmaster on a consultancy and make good use of other consultancies to take our work forward. For example, the recently conducted survey of DPC members, which formed the first phase of our UK Assessment exercise, was undertaken by Duncan Simpson; and Michael Day, of UKOLN, collaborates with the NLA, on behalf of the DPC, to produce the quarterly issue of What's New in Digital Preservation? I've already mentioned the Technology Watch reports and the PR consultancy. We're trying to make the most effective, flexible use of our funding to get the best value from it, so we need to be a very lean organization.
What would you say have been the most effective enablers of your program? What barriers have there been to achieving your goals and objectives? Have there been any surprises in either category? Is there anything else of note that you would like to share about the DPC—past, present, and future?
The most effective enablers of our program have been the DPC members, in particular JISC, as I've indicated above, and also the DPC Board, who have been incredibly supportive and helped to guide the direction of the organization. The barriers have mainly to do with resources—there's a lot to do and a lot we need to do. But I've been grateful that there was already a work plan to guide me in selecting the most urgent priorities. I've made a New Year's resolution not to get stressed by what isn't being done, but to concentrate on what is achievable. As an Australian colleague was fond of saying, Concentrate on the doughnut, not on the hole!
 |
 |
 |
 |
 |
 |
 |
 |
 |
Feature Article 3 |
|
 |
 The Bits and Bites of Data Formats—Stainless Design for Digital Endurance
Author: Andreas Aschenbrenner - ERPANET (aschenbrenner@student.ifs.tuwien.ac.at)
 |
 |
 |

Editors' Note XML has been at the center of numerous digital preservation discussions for the past several years. This piece raises some caveats and concerns regarding the practical use of XML for preservation purposes. In future issues, we will continue to provide additional takes on XML and other potential digital preservation solutions.
When archiving digital information over the long-term, we are confronted with the rapid pace of technology. Software formats vanish into obsolescence even faster than tangible elements do. In addition to their irritatingly short expiration date, the variety of formats makes reusability of information and system independence a faint hope.
Many people point to standard formats as a solution that will facilitate deciphering objects into the future. Undoubtedly standards may alleviate some problems in enabling digital preservation. Individual data formats, however, are designed for specific reasons. Conversion from one format to another may entail loss of some of these features. Loss of information is the most-imminent risk, though not the only one.
After discussing some of the raisons d'être of data formats from a technical perspective, this article highlights restrictions in translating between formats, specifically in the example of XML, [2] and investigates implications for digital preservation.
In pointing out the idiosyncrasies and the use of specific data formats, this text concludes with a positive supposition: data formats are not an obstruction to digital preservation in themselves. No format is more equal than the others, but some formats are more appropriate in some environments and for some requirements than others.
Design Criteria for Data Formats
Specific design goals guide the definition of a data format. Software dependency is a side effect of proprietary formats that may, as some speculate, be welcomed by some profit-oriented vendors, but there are more-objective design criteria and motivators that explain the myriad of existing formats, as suggested by the list below: [3]
- The main reason for defining a data format is to store information.
- Taking a closer look at this, the format is often a container for information at different levels. Besides the actual content, the format may also store information that controls specific functionality of a software application. [4]
- Another design criterion may be the size of the resulting data object. While storage space is getting cheaper by the day, the volume of digital objects grows excessively as well. At the same time, the larger the object, the longer its transmission takes via a network. So saving storage remains a consideration.
- Implications for the performance of a system may need to be considered. A data format may be required to facilitate efficient access to and manipulation of information. For that purpose, a format may be geared towards a specific application.
- In some instances, extensibility and generality of a data format may be highly desirable.
- Other formats may influence the design of a new data format with the aim of allowing for compatibility. Although backward or forward compatibility, for example, is unlikely to be sustained over long periods of time and a number of technology generations, it may play a role in the definition of a data format.
- Specific measures could be taken to ensure the integrity of information. These include incorporating design features supporting robustness against data loss and impeding unauthorized tampering of the information. Also, security or confidentiality concerns could drive design decisions.
This list of design criteria is not exhaustive. It is, however, sufficient to explain the current multiplicity of data formats. At the same time, it explains their rapid evolution, which goes in hand with the progressively changing requirements that are the basis of these criteria. Even standard formats will not survive for eternity.
Note that there is tension between some of the above criteria. For example, a data format that is compressed in size (3) may be slow to encode (4). Some redundancy in formats, which is unfavorable for their size (3), may be conducive to their integrity (7). There are, however, various ways to reach one's ends. Instead of requiring inherent redundancy, the robustness of a format may also be enhanced by more-explicit measures, for example, by the use of cyclic redundancy checks.
Some of the criteria are inherent in the format, while others may be external or supplementary. For example, cyclic redundancy checks could be part of the file format or stored externally in the system environment in which the object is embedded. Similarly, information about the functionality of a specific object could be stored partly in the object itself and partly externally. Likewise, compression algorithms can be used to manipulate the size of an object after its creation.
This discussion demonstrates that a small number of standard formats can hardly embrace all requirements of individual software applications. Moreover, these requirements may change during the existence of a digital object; its required features are not necessarily the same in active use as when finally archived. Although a standard format may be too restricted for use in active applications, its features may satisfy preservation requirements.
XML is often and for good reasons promoted as a standard format for preservation purposes. However, an XML-based preservation format is inappropriate for some data types and in some circumstances, as the following section will highlight based upon a review of the above criteria.
Putting XML to the Test The advantages of XML have been exhaustively discussed [5] and are undeniable. In particular, the features of XML that foster human readability and system independence are invaluable. Some loose ends remain, however.
A word on syntax, structure, and semantics
XML defines a surface syntax for structured documents—their notation and basic structural rules. For constraining the structure of XML documents the World Wide Web Consortium (W3C) developed the language XML Schema [6]. An XML Schema can be associated with an XML document to specify which structural elements may occur at what point. As such, the Schema definition may be compared to the actual definition of a data format with an XML syntax. In other words, XML on its own is insufficient to serve as a complete data format.
Given the need to impose structure on XML, there is the risk of a variety of XML Schemas being defined, each for a slightly different use. That is happening at the moment; current initiatives tend to embark on developing Schemas on their own and only for their own use. The flood of XML Schemas, however, encumbers digital preservation as well as reusability and interoperability just as the myriad of other data formats does.
Moreover, both languages in tandem, XML and XML Schema, are insufficient for expressing the semantics—i.e. the meaning—of a digital object. Other languages from the toolkit of the W3C Semantic Web Activity are further building blocks for facilitating machine understanding and automation on a basic semantic level. But eventually exhaustive documentation of a data format is necessary to make it human-understandable and to preserve its meaning.
Considerations for size and performance
In the design of an XML Schema, the above criteria need to be considered as for any other data format. The application of XML alone does not ensure a desirable and reliable format. This is even more true for the criteria of size and performance. The XML syntax makes it difficult, if not impossible, to define formats that are small in size and/or facilitate performance.
Difficulties in converting formats to XML
Perhaps most important, a number of data types cannot reasonably be translated to XML. Take, for example, an image format. It is, of course, possible to mark up an image in XML:
‹image›‹pixel›‹position›‹x›1‹/x› ‹y›1‹/y›‹/position›‹color› ‹red&saquo;10‹/red›‹green›5‹/green› ‹blue›0‹/blue›‹/color›‹/pixel› ‹pixel›... ‹/image›
In another data format, the same might be expressed as
10,5,0,...
A more-appropriate solution would be to select one of the widely available standard formats for images such as TIFF, PNG, or JPEG2000. But different image formats serve different purposes, too.
There are other kinds of data that have not been considered for translation or simply cannot be translated into XML, including audio, video, and 3D simulation models. Similarly, large repositories of scientific data may deliberately choose not to adhere to XML.
Considering human readability
When reviewing XML's acclaimed feature of human readability, we find that it is not inherent in the XML format. Of course, XML leverages human readability. However, its elements must be named such that they are also human-understandable.
To underline this argument, we take the example above. We have seen that the version of the image marked up in XML is quite a bit longer. Let us try to improve on that:
‹i›‹p1›‹p2 x=”1” y=”1” /›‹c r="”10” g=”5” b=”0” /› ‹/p1› ‹p1› ... ‹/i›
So it is possible to produce slightly more compact XML code—this statement is half as long as the initial XML example above. Although this is beneficial for the criterion of size, it encroaches on human readability. It is now not obvious that ‹P1›stands for a pixel and ‹P2”for the position of this pixel. To reiterate, XML does not automatically mean human-readable. There may, in fact, be non-XML data formats that are more easily understandable—perhaps with the assistance of a brief external explanation. The importance of documentation is therefore emphasized in this context. The wide gap between human-readable and human-understandable needs to be bridged by suitable documentation.
So it is possible to produce slightly more compact XML code—this statement is half as long as the initial XML example above. Although this is beneficial for the criterion of size, it encroaches on human readability. It is now not obvious that < P1 >stands for a pixel and < P2 >for the position of this pixel. To reiterate, XML does not automatically mean human-readable. There may, in fact, be non-XML data formats that are more easily understandable—perhaps with the assistance of a brief external explanation. The importance of documentation is therefore emphasized in this context. The wide gap between human-readable and human-understandable needs to be bridged by suitable documentation.
Reusability and interoperability
Like human readability, reusability of digital objects and interoperability are not inherent properties of XML either. Again, it depends on whether two partners in an interaction adhere to a common XML Schema. Deciphering an XML file demands even more than an XML Schema: the meaning of marked-up information must be understandable, which goes beyond mere structural definition. As a particular manifestation of this requirement, a proprietary file format wrapped in XML remains proprietary. Generally speaking, just because a file is XML-based doesn't mean it will be open. [7]
Final thoughts
In summary, XML is not a one-size-fits-all solution. In the end, the human designer will determine if its advantageous features can be exploited. In some situations and for some data types, it may prove better to use either a standard format or possibly a dedicated data format instead of one based on XML.
Many digital objects, however, will be converted to an XML-based format for preservation—for example, text-processing documents, which represent a huge percentage of the overall mass of digital objects. Text-processing software employing XML formats, such as OpenOffice, are particularly interesting for preservation initiatives in this context. These XML formats will have to be further evaluated to determine their human-readability. These formats may prove viable for preservation purposes because of the availability of authoritative technical specifications and their capacity to conserve significant properties without including large quantities of unnecessary elements.
Moreover, the possibilities offered by the XML family of specifications are only now being explored. XML in combination with RDF and other standards being developed by the W3C offer powerful possibilities. Employed with thoughtfulness and diligence, XML-based formats, like other data formats, will be an important component of a preservation solution.
Data Compression in Digital Preservation
Compression algorithms were developed for those data formats to produce small-sized files. From this perspective, the compression of a data file is simply the translation of a data format into another. Preserving compressed objects is consequently a manageable challenge. The same measures that are taken for any other data format ensure the accessibility of compressed information. Obviously, compression algorithms may become obsolete just like any other technology. The conversion from one algorithm to a new one thus needs to be completed before it's too late.
Nonetheless, applying compression algorithms may be less risky than converting from one data format to another. As the name implies, compression algorithms can be expressed in mathematical terms. This reduces the risk of inadvertently losing information in the compression or decompression process. Moreover, the compression tool can be immediately tested to see whether it correctly implements the algorithm. With suitable technical documentation, the digital objects can be decompressed now and in the future.
Usually the inversion of the compression algorithm yields the original data. Some algorithms are lossy [8] in the sense that they reduce the quality of the original in the process of compression. This may not be acceptable when digital objects need to be preserved, and lossless compression algorithms should be chosen. [9]
In a nutshell, compression algorithms do not obstruct digital preservation if the algorithm is carefully selected, preservation methods such as migration are applied with care, and exhaustive documentation is retained.
Encryption in Digital Preservation
Considerations for managing data formats apply equally to the encryption of digital objects as they do for compression. Similarly, preserving encrypted digital objects is not an unmanageable challenge provided the necessary precautions, described above, are taken. The importance of documentation and metadata cannot be emphasized enough at this point, as one missing piece in the jigsaw puzzle may prevent access to an object in the future. A central component when preserving encrypted objects is the key to decrypt it again. The key must be reliably preserved in a secure place.
Conclusion
Despite the obvious advantages of standard formats for digital preservation, as well as reusability of data and interoperability of systems, they have to be applied with due consideration. Standard formats may in some situations fall short of specific requirements. In the design or choice of a data format, the above criteria need to be taken into account.
Moreover, as part of the above criteria, a preservation format must adequately preserve the intellectual content of a digital object. The elements included in an object's intellectual content may be defined in its significant properties, a term coined by the digital preservation project Cedars. Here it is important that significant properties be defined for individual objects in specific environments. Generic significant properties defined for a data format fail to address the preservation requirements of each and every organization. For instance, the formatting of a report may be integral to one organization, while, for another, the retention of a plain-text transcription suffices. Or, as another example, in some contexts the specific functionality of software that reflects in the data format may be important, which is considered extraneous in other environments using the same software. In these cases, the preservation formats of the two organizations will differ even though they both started from the same active data format. More than that, different processes in the same organization may raise different preservation requirements. This may indeed lead to varying preservation formats for the same active data format in the same organization.
All this calls for a more-careful selection of data formats, comprehensive documentation of them, and active management of the digital objects throughout their existence. To attain interoperability, all stakeholders have to present their requirements, and subsequently a suitable format can be chosen or designed in a cooperative effort. This may sometimes be a painful process, and it is unlikely that there is any format that satisfies all variations of preservation needs for a specific data type. Installations such as registries that are being developed for metadata [10] and for file formats [11] may offer the possibility of sticking to local variations in formats while at the same time allowing interoperability on a more global level.
On the whole, data formats can be useful tools that even support a specific preservation solution. Bearing in mind, however, that since a screwdriver should not be used to drive in a nail, a careful selection of tools is paramount. [12]
Notes
[1] The author is the Dutch content editor for ERPANET.
[2] As will become clear in the following, the XML language alone is not a data format, but, together with other members of the XML family, it is a tool for defining one.(back)
[3] This article assumes that stakeholders are interested in interoperability, promote openness, and strive to work cooperatively towards preserving their information for future generations.(back)
[4] Take an instruction in a text-processing format, for instance, that prompts the document to open at a specific size— the content of the text does not change if this information is missing; or an image format that has the capability to store image data in different layers—the layers are not recognizable when the image is viewed. Software needs dedicated information within the data format, however, to provide such functionality.(back)
[5] One of numerous initiatives discussing the advantages of XML in preservation is the Digital Preservation Testbed in its white paper XML and Digital Preservation (September 2002).(back)
[6] Formerly, document type definitions (DTDs) were used. They are currently being superseded, however, by XML schemas (.xsd). Refer to the W3C Web site for more information.(back)
[7] The IDA (Interchange of Data between Administrations) Open Source Migration Guidelines. Guidelines funded by the European Commission. (October 2003), p. 24.(back)
[8] While lossy compression may be possible in areas other than image compression, it is not sensible for text compression, for instance. For more-detailed information, refer to Wikipedia.(back)
[9] Even if a loss of quality may appear acceptable in the present, future use may demand the original quality. For preserving digital objects, lossy compression should therefore be applied only after careful deliberation.(back)
[10] Michael Day. Integrating Metadata Schema Registries with Digital Preservation Systems to Support Interoperability: A Proposal. In: Proceedings of the 2003 Dublin Core Conference: Supporting Communities of Discourse and Practice—Metadata Research and Applications, Seattle, Wa., USA, 28 September-2 October 2003.(back)
[11] Stephen L. Abrams and David Seaman. Towards a Global Digital Format Registry. World Library and Information Congress: 69th IFLA General Conference and Council, August 1-9, 2003, Berlin, Germany.(back)
[12] Other technical format specifications, including protocols and interface definitions, are similar tools, and the issues in this text apply to them analogously.(back)
 |
 |
 |
 |
 |
 |
 |
 |
 |
Highlighted Web Site |
|
 |
 The Apache Cocoon Project
 |
 |
 |

The open-source Cocoon Project began as a way for the Apache Software Foundation, creators of the Apache Web server, to separate a constantly changing piece of text from the unchanging piece of HTML that displayed it on the Web. This dilemma is common, since Web content, style, and logic are often created by different individuals or working groups and often have different needs. Cocoon aims to completely separate these three layers, allowing Web documents to be independently designed, created, and managed.
To do this, Web information is first stored in an XML document and is then transformed on the fly by the Cocoon server engine into a variety of formats in which it might be needed. Cocoon components allow you to start with a single XML file and a handful of XSLT templates, and end up with a variety of information formats, including HTML, RSS, PDF, FOP, and WML. Since XML is based on Unicode, it can also be used to quickly encode pages into a variety of languages. This ability to create derivatives on the fly from XML-encoded masters has considerable relevance for preservation. Other, more-general benefits include a reduction in page-maintenance overhead and an increased ability to reuse code.
View Cocoon's sample installation
 |
 |
 |
 |
 |
 |
 |
 |
 |
FAQ |
|
 |
 Handwriting Recognition for Historical Documents
Author: Richard Entlich
 |
 |
 |

OCR (Optical Character Recognition) seems to be widely used for providing searchable indexes of printed texts that have been scanned. Is it possible to do a similar thing with handwritten manuscripts and correspondence?
OCR Background
OCR is used to generate machine-readable text from printed documents. These are generally legacy documents from before the electronic publishing era, but may also be printed documents for which the original machine-readable text was discarded or lost.
OCR of printed text is a well-developed technology that has steadily improved in accuracy and flexibility. Initially limited to interpretation of numerals printed with special fonts, current day OCR software can deal with a multitude of fonts, character sets, languages, and page attributes. For extremely clean and well-scanned documents, the resulting text may be good enough to use for direct display purposes. More commonly, the OCR is somewhat "dirty" (i.e., contains errors) but is still accurate enough to form the basis of a quite usable machine-searchable index. Accuracy rates of 99.5% and higher (at the character level) are achievable for good quality source documents.
Handwriting Recognition Background
The conversion of handwriting to machine-readable text is usually referred to as handwriting recognition (HR). Computer scientists recognize two distinct classes of handwriting recognition. The better known of these is on-line HR, a real-time process usually employing a special stylus and pressure sensitive tablet that allows the direction and order of the writer's strokes to be monitored while writing. First popularized by the Apple Newton MessagePad, on-line HR is now available on most PDAs (Personal Digital Assistants).
The process of converting an existing handwritten document into machine-readable text is called off-line HR and is more closely analogous to OCR. Off-line HR is a far more daunting computing task and, as a result, is not as mature a technology as either OCR or on-line HR. The reasons are not hard to fathom.
Unlike printed text (that is, machine produced type), handwriting is subject to almost infinite variation. Cursive writing, in particular, can easily defeat human attempts to interpret it, as anyone who has attempted to decipher a doctor's handwriting can attest. Machine interpretation relies on reducing the scanned image to some kind of recognizable pattern. Patterns may be missed because of vague word boundaries, overlapping letters, and great variations in the slant, spacing and shape of letters. Such variations may be modest within the writings of a single author, but are tremendously magnified across multiple authors. Further hampering recognition, handwritten documents tend to be "noisier" than printed ones due to smudging, staining, stray marks, underlining, and cross-outs.
Thus, early work in off-line HR, like that in OCR, focused on small, simple character sets such as numerals. Even today, much research and development is focused on highly constrained tasks such as reading cities, states and zip codes on hand-addressed mail, interpreting the dollar amount line on bank checks, or deciphering business forms, such as tax returns.
Methods for Off-line Handwriting Recognition of Historical Documents
Figure 1. A portion of a scanned page from the Library of Congress's George Washington manuscripts. Rectangles have been drawn around where the words would be segmented. Also, dark lines which result from the scanning process have been removed from the sides. Note that the segmentation process is not perfect. "Winchester" in the fifth line and "Nicholas" in the next to last line have been divided into two parts.[1]
A small but steady stream of computer scientists has been trying to tackle the difficult task of deciphering cursive handwriting. The desire to improve access to large collections of important historical manuscripts has motivated most of this work. Scanned versions of the papers of Isaac Newton, U.S. presidents (especially George Washington), and the Archives of the Indies in Seville, Spain, to name a few, have served as recent experimental fodder.
In most cases, the objective of these experiments is less ambitious than full machine translation of handwriting. Instead, the goal is usually to recognize a subset of the most commonly used vocabulary (anywhere from a few hundred to one or two thousand words), usually within the writings of a single author. That vocabulary then serves as an index to support text queries. Limitations on vocabulary and authorship are intended to simplify the computational task so it can be done in a reasonable period of time, at an acceptable cost, and with a usable degree of accuracy.
Here are descriptions of a few of the different techniques being investigated:
Character segmentation attempts to identify individual characters and build them into words. This is exceedingly difficult to do with any degree of accuracy.
Word segmentation attempts to detect word boundaries, often supplemented by other document cleaning and filtering operations such as artifact removal, normalization of slant, smoothing, and binarization (converting grayscale images to bitonal). An effort can then be made to recognize the pattern made by an entire word and convert it to machine readable form without trying to identify individual characters.
|
 Original grayscale image
|
|
 Binarized with artifacts removed
|
|
 With slant correction
|
Figure 2. Processing and normalization steps on a segmented word image prior to image matching.[2]
Word spotting is a form of off-line HR using word segmentation. In word spotting, the segmented words are first normalized to minimize variation and then similar images, which hopefully represent the same word, are clustered together. These groupings are called equivalence classes. No machine interpretation is done, only image matching. The groups of matched words are then displayed to a human operator who provides the text equivalent. Figure 3 shows a simplified diagram of the word spotting process, though stop words like "the" and "that" would normally be discarded. A subset of the most frequently occurring remaining words is used to create an index of the document.
Word spotting has also been applied in multiple author environments where word segmentation is not feasible, using different image matching techniques. (View enlarged image)
Figure 3. A conceptual diagram of the word spotting technique for indexing of matched word images.[3] (View enlarged image)
Statistical methods built on word segmentation are also being explored. Within a set of documents by a single author, a training subset is word segmented and manually transcribed. The images of the words are described using a highly formalized language based on the features (size, sequence of hills and valleys, etc.) of the particular image. The statistical correlation of the transcribed words with their feature-based descriptions is recorded for the entire set of training documents. Subsequently, a textual query can be made against a set of documents from the same collection that have been word segmented and feature described, but not transcribed. The query returns a set of word images (within a single line of the original document) most likely to match the query terms.
Transcript mapping is a technique used when a transcript of a handwritten document has been created, but it is unknown how the transcript corresponds to the location of words (pages, lines, and line position) in the original document. The existence of a transcript defines the vocabulary of the document, leaving the still non-trivial task of determining precisely where those words occurred.
Commentary
The amount of research activity and the variety of clever techniques being utilized in off-line HR should be gratifying for the archivists who maintain, and the scholars who utilize, handwritten historical documents. However, it should be noted that none of the work described here appears ready to emerge from the laboratory anytime soon.
Unconstrained machine translation of handwriting appears particularly far off, and may be unachievable. Even a less ambitious goal, such as software to reliably create partial indexes from good quality single author material, is unlikely to be met within the next several years.
However, enough progress seems to have been made for librarians, archivists, and scholars to become more involved in the ongoing research. Until now, there appears to have been little participation by those parties other than to provide sample documents and, on occasion, to serve on advisory boards.
For librarians and archivists, the future potential for machine translation should at least be considered when handwritten historical documents are digitized, particularly large collections by authors with legible handwriting. Since documents deemed worthy of digitization are likely to be of greater than usual significance, they are also good candidates for transcription and/or indexing. Accurate off-line HR depends on scans with minimal noise and artifacts, so some additional effort to create very clean scans may be merited.
For those documents that are deemed so significant that it is worth fully transcribing them manually, the transcripts should record page, line, and word position to facilitate the potential to create indexes that can pinpoint the search term's location in the scanned document. (This presumes the document can be word-segmented, so the nature of the author's handwriting is again a consideration.)
Archivists could also advise computer scientists about how best to produce indexes that would interoperate smoothly with existing machine-readable finding aid standards such as EAD (Encoded Archival Description).
From the computer science side, more consultation with archivists and librarians familiar with the scanning of historical documents could avoid certain costly mistakes. For example, some of the researchers spent time cleaning up highly compressed JPEG files that suffered from severe artifacting around the text, instead of starting out with uncompressed or losslessly-compressed TIFFs.
Others have worked with low-resolution grayscale images that they have binarized using static thresholding techniques (that is, a single threshold value was used to binarize an entire page or collection of pages). Historical documents are usually scanned at 8-bit grayscale because they tend to be too tonally rich for satisfactory bitonal capture. However, some of the computer scientists seemed unaware of the availability of scanning software capable of dynamic thresholding and automatic background detection and suppression. Such software can produce bitonal scans with uniform contrast and text legibility even from originals with stains, fading, and uneven ink density.
In the meantime, if it is discovered that certain scanning practices would substantially improve the prospect for usable HR of historical documents, those standards should be promulgated to libraries and archives for consideration.
Finally, it is unclear what role scholars are playing in the development of systems for HR on their behalf. Though most of the details of crafting a successful off-line HR system fall within computer science and closely related realms, there are certain questions that only the end users of historical documents are in a good position to answer.
What corpora of historical documents would benefit most from being made searchable? If a search vocabulary has to be whittled down in size in order to reduce the computational load, which terms should be given priority for retention? Should the most common terms be kept, or should personal names, place names, or dates be preferred? What degree of inaccuracy can be tolerated before an index loses its value?
Conclusion
There is as yet no commercial or open source software for automatic transcription of, or the creation of searchable indexes from, handwritten historical documents. However, it is an active area of research and progress is being made. Continued advancement depends on the availability of funding. Librarians, archivists, and scholars may be able to push the agenda more effectively by partnering with computer scientists who share an interest in solving this challenging problem and improving access to significant historical archives.
Further Reading
Note: Much of the literature of off-line HR is highly technical. Some of the following papers provide a general overview of the subject, while others are best read for their abstracts, introductions, and conclusions (unless, of course, hidden Markov models and affine transforms are your cup of tea). All documents are PDFs.
Kane, Shaun, Andrew Lehman, Elizabeth Partridge, "Indexing George Washington's Handwritten Manuscripts: A Study of Word Matching Techniques." Technical Report of the Center for Intelligent Information Retrieval, University of Massachusetts, 2001.
Keaton, Patricia, Hayit Greenspan and Rodney Goodman, "Keyword Spotting for Cursive Document Retrieval," Proceedings of the IEEE Workshop on Document Image Analysis (DIA '97), in conjunction with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '97), June 1997, San Juan, Puerto Rico, pp. 74-81.
Koerich, A. L., R. Sabourin, C. Y. Suen, "Large Vocabulary Off-Line Handwriting Recognition: A Survey," Pattern Analysis and Applications, v. 6, no. 2, pp. 97-121, July 2003.
Manmatha, R., "Word Spotting: Indexing Handwritten Manuscripts," DLI2/IMLS/NSDL Principal Investigators Meeting, Portland, Oregon, July 17-18, 2002.
Plamondon, Rejea and Sargur N. Srihari, "On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, v. 22, no. 1, January 2000.
Rath, Toni M., Victor Lavrenko and R. Manmatha, "A Statistical Approach to Retrieving Historical Manusript Images without Recognition." Technical Report of the Center for Intelligent Information Retrieval, University of Massachusetts, 2003.
Tomai, Catalin I., Bin Zhang and Venu Govindaraju, "Transcript mapping for Historic Handwritten Document Images," Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR'02), pp. 413-418, September, 2002.
Verma, B., M. Blumenstein & S. Kulkarni "Recent Achievements In Off-Line Handwriting Recognition Systems," International Conference on Computational Intelligence and Multimedia Applications (ICCIMA '98), Melbourne, Australia, pp. 27-33, 1998.
Notes
[1] Image courtesy of R. Manmatha, Center for Intelligent Information Retrieval, University of Massachusetts.(back)
[2] Originally published in Rath, T.M., S. Kane, A. Lehman, E. Partridge and R. Manmatha, "Indexing for a Digital Library of George Washington's Manuscripts: A Study of Word Matching Techniques," Technical Report of the Center for Intelligent Information Retrieval, University of Massachusetts, 2002. Used with permission.(back)
[3] Adapted from Manmatha, R., "Word Spotting: Indexing Handwritten Manuscripts," DLI2/IMLS/NSDL Principal Investigators Meeting, Portland, Oregon, July 17-18, 2002. Used with permission.(back)
 |
 |
 |
 |
 |
 |
 |
 |
 |
Calendar of Events |
|
 |

 |
 |
 |

Basics and Beyond: Digitization Fundamentals Online Course February 23-March 12 April 12-30 August 2-20 The Illinois Digitization Institute at the University of Illinois Library at Urbana-Champaign is offering a course entitled Digitization Fundamentals. The course will be directed towards participants from libraries, museums, archives, and other institutions who are seeking in-depth digitization instruction to work with cultural heritage materials.
Digital Preservation Management: Short-Term Solutions to Long-Term Problems Registration opens March 1 for the May 10-14, 2004 Workshop Cornell University Library, Ithaca, NY Cornell University Library will offer the next digital preservation management workshop in the week of May 10-14, 2004. This limited enrollment workshop has a registration fee of $750 per participant. The next workshop is scheduled for July 19-23, 2004 (registration will open this summer) and the third one this year will be held November 1-5, 2004.
DLKC'04 International Symposium on Digital Libraries and Knowledge Communities in Networked Information Society March 2-5 Ibaraki, Japan This symposium will discuss the current status and future prospects of digital libraries and knowledge communities in our networked information society.
Visual Resources Association Conference March 9 Portland, Oregon The annual conference of the Visual Resources Association includes a seminar on issues in the preservation and conservation of born-digital art, as well as topics on cataloging, sharing, and managing visual resources in a library and museum context.
Digital Resources and International Information Exchange: East-West March 12-19 This series includes sessions on the following topics: International cooperation in modern information environment, Library associations and globalization of information environment, and Role of public libraries in cultural heritage preservation in digital environment.
Slide Libraries and the Digital Future March 24 London, UK The event is aimed at slide librarians and those responsible for visual collections. The day will aim to address issues and inform practitioners about resources and good practice in the use of digital images in the UK.
Planning, Fund Raising and Tendering for Digitization Projects April 6 London, UK This workshop is aimed mainly at managers in museums, libraries, local history units, and county archives. Topics include Why digitize, Planning and fund raising, Tendering for services (plus in-house versus outsourcing), and Skills, training, and recruitment.
The Role of Audit and Certification in Digital Preservation April 14-16 Antwerpen, Belgium This three-day ERPANET workshop, co-hosted by the Stadsarchief Antwerpen, will explore the purposes of audit and certification and examine their use within digital preservation. Issues will include: the implementation of audit frameworks and standards, Corporate roles and responsibilities, and Experiences of audit (financial management, government, and information systems).
Archival Perspectives in Digital Preservation: Society of American Archivists Apr 15-16 New York, NY How do you make the connection between fundamental archival principles and the idea of “digital preservation” as it has evolved since 1996? Drawing on a growing technical literature defining digital preservation requirements, the seminar explores how concepts such as integrity, authenticity, and trust are embedded in specific digital preservation development programs, including the work of OCLC/RLG, InterPARES, and selected European initiatives.
Web Tools for Librarians May 11-14 Madison, WI The School of Library & Information Studies at the University of Washington will present a series of programs for librarians facing the challenge of delivering library services via the Web. Workshop topics include XML, Open source software, Web content management, and Using the Web to communicate.
School for Scanning: Building Good Digital Collections June 2-4 Chicago, IL Presented by the Northeast Document Conservation Center, this conference provides current information for managers of paper-based collections (including photographs) who are seeking to create, manage, and preserve digital assets.
Joint ARLIS/NA and VRA Summer Educational Institute for Visual Resources and Image Management July 7-10 Durham, NC The educational institute is intended to provide a standardized and sustainable program for visual resources training, with a focus on issues related to the transition from analog to digital collections.
 |
 |
 |
 |
 |
 |
 |
 |
 |
RLG News |
|
 |

 |
 |
 |

Continuing RLG Forum Series—To Have and To Hold: Metadata and Institutional Repositories Following on the well-received To Have and To Hold: Metadata and Institutional Repositories forums held at the Library of Congress and the Chicago Historical Society in December 2003, RLG has scheduled two more events in this series. This one-day forum covers two interrelated topics pertinent to members and non-members alike: metadata and institutional digital repositories. Featuring RLG member experts and other speakers local to the venue, the forum serves as an educational opportunity for those desiring to learn more about how peer institutions are addressing the challenges related to long-term access to and preservation of digital materials.
On 7 April 2004, a forum will be held in Stanford, California. Hosted by the Hoover Institution at Stanford University, the forum has been timed to precede the METS Opening Day event at Stanford University. This educational opportunity is open to all RLG member staff and to those attending METS Opening Day. Speakers and a full agenda will be announced shortly. The program will be similar to December events. For more information, contact Fran.Devlin@notes.rlg.org.
On 18 May 2004, this forum will travel to Europe and be held in Den Haag, The Netherlands. Hosted by the Nationaal Archief, the final forum in this series will feature expert speakers from our member institutions based in Europe. Speakers and a full agenda will be announced shortly. For more information, contact Fran.Devlin@notes.rlg.org.
 |
 |
|
 |