![]() |
||
| October 15, 2003, Volume 7, Number 5 |
ISSN 1093-5371 |
|
|
PRONOM—A Practical Online Compendium of File Formats Jeffrey Darlington
Risk Management of Digital Information: A File Format
Investigation,
The correct interpretation of records has always required knowledge of the language in which they were written, and sometimes of other subjects too - medieval penmanship, for example. Fortunately enough of this knowledge has survived that we can make sense of most of the records that have come down to us. Modern technology has further complicated the problem of interpretation by making the viewing of records dependent on hardware and software environments whose own longevity is doubtful. Just as interpretation of the 1086 Domesday Book depends on the dictionaries and grammars for medieval Latin painstakingly compiled by long-dead scholars, interpretation of contemporary electronic records in the future will only be possible if the necessary methods and tools are compiled, documented, and preserved now. When, in the evolution of computing, punched cards were superseded by magnetic tape, computer records became invisible and readable only by computers. Preservation programs were started up as early as 1962, and it was soon recognized that the incompatibility of tape formats would complicate the task of preservation. This realization led to the development of the ASCII character code, and other standards for media and recording formats that have served the preservation community well over the years. The new problems generated by the evolution of the personal computer and the word processing application were not so readily recognized. The new tools were seen at first as merely a method of creating paper documents that could be archived on paper. File formats were not standardized and soon proliferated alarmingly, every new software product having its own format. Dimensions of Incompatibility Besides the incompatibility between products, there is the time dimension. For each product, new versions of its file formats quickly superseded earlier ones. As more and more facilities were added, the formats became more complex. Sometimes the old format was a subset of the new, giving forward compatibility, but often the appearance of an old record was not rendered correctly by new versions of the software. These problems have multiplied as records have evolved from simple texts to complex assemblies of diverse elements, including embedded formulae, images, and charts. Web sites go a stage further with animations, video clips, and dynamic content. There is no paper analog that could be archived in these cases. And yet another dimension is that of dependencies. There are layers of application software that depend on operating systems that in turn depend on hardware. Anyone responsible for managing and sustaining access to electronic records, even over relatively short timescales, must deal with these challenges to ensure that valuable records are not left stranded in formats that are no longer supported. One approach to a solution is to migrate records from obsolete formats into current ones, ideally into formats with published standards. Another is to preserve records in their original formats and keep copies of software products that can interpret those formats. To preserve the ability to run those products, copies of operating systems must also be kept. And in principle, old models of hardware must be either preserved (the museum approach) or emulated in software. The emulation approach is described by Rothenberg. Technical information about file formats and the software products that support them is a prerequisite for any digital preservation regime. Introducing PRONOM The first version of PRONOM was developed in March 2002 in parallel with the development of our Digital Archive system. It was designed to hold reliable technical information about the nature of the electronic records to be stored in the archive. For example, for Microsoft Word 97, PRONOM will tell you when it was launched, by whom, whether it is still on the market and whether it is still supported, what formats it writes, and what formats it reads. Interest expressed by other national archives led to the concept of distributing the database on CD, and PRONOM 2 was released in December 2002 to provide support for multilingual versions of the system. The system was designed from the outset with a Web-based user interface and XML system interfaces, to conform to the UK e-Government Interoperability Framework. The latest version, PRONOM 3, is now being launched on the Web to make it available to the whole preservation community. We have simplified the user interface in the light of our own usability testing. The main search page allows the user to look up a file extension and see all the formats PRONOM recognizes with that extension, some extensions being shared by a number of products. It also allows the user to search for potential migration paths for a given format. Content Development We also began to build up our library of software products. This was an easier task, and it turned up a useful new source of information. The boxes in which distribution CDs are packed often display information about operating system dependencies and compatibility with other products that is not published elsewhere. Since that time, considerable effort has been devoted to the collection of PRONOM content. National Archives staff have undertaken intensive research and liaison with major software developers in order to create an initial core data set of software product information. Microsoft has been particularly helpful. For the initial data load the focus is on the most commonly used office products for PC operating systems from PC-DOS onwards. We intend to load information on about 450 products over the next few months. The content development work is ongoing, and we have at least some information on over 3,000 file formats yet to be verified. We encourage software developers and others to be proactive in providing information - there is an online submission form on the PRONOM Web site. Preservation Strategies We need a measure of how much the information content of a record is altered as a result of a particular migration process. Information content includes formatting, and functionality that is integral to the record rather than to the creating software (the difference between a hyperlink in a Word document and the fact that Word includes a spell check function). The design of PRONOM allows us to keep a measure of the “content invariance” of a migration path. Our intention is to define an objective and rigorous methodology for testing migration paths to measure content invariance, and to record this within PRONOM for each format that a particular product can read. Migration paths for WordStar data are complicated because early DOS versions of WordStar used seven-bit ASCII characters, the eighth bit being used as a line wrap marker. When viewed by later products, these characters are wrongly interpreted as eight-bit ASCII equivalents, and to achieve a successful migration it is necessary to strip out the marker bits from the WordStar files. Since line wrapping is handled differently in later products, the loss of the eighth bit normally makes no difference, and at worst causes the text to be adjusted to different margins. This example shows the part that detailed technical knowledge plays in implementing a workable migration strategy, and also the importance of keeping the original bit-streams. Technologies are constantly evolving, and archivists must be aware of these changes and the implications for the electronic records they are preserving. As old software products cease to be supported and become obsolete, preservation activity will be needed for the file formats that depend on those products, whether by content invariant migration or by preserving the software. A “technology watch” process to identify triggers for preservation actions is a component of the PRONOM program. There would be many advantages in migrating records to an XML-based standard format. The development of practical tools for this task depends on detailed technical descriptions of how each format actually works. This information is beyond the present scope of PRONOM, but we plan to collect it for the next major release. A subset of this information would be useful to develop tools for the recognition of file formats, a function included in our Digital Archive and at present provided by commercial viewer software. We expect to include this recognition function in a later version of PRONOM. The Web-enabled PRONOM 3 completed this month, marks the latest stage in the evolution of the system. At the same time, it is a starting point for the development of PRONOM as a major shared online resource for the international digital preservation community. The National Archives has plans for major enhancements over the next few years, including the development of a number of specific tools to support digital preservation activities. Further information about our future plans is available on the National Archives Web site. We hope that through our continuing research and contributions from the Web community, the content of PRONOM will expand to give a comprehensive coverage of file formats that will support worldwide preservation initiatives. Acknowledgments
Publishing Information RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org. Materials in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given to use material found here for research purposes or private study. When citing RLG DigiNews, include the article title and author referenced plus "RLG DigiNews." Any uses other than for research or private study require written permission from RLG and/or the author of the article. To receive this, and prior to using RLG DigiNews contents in any presentations or materials you share with others, please contact Jennifer Hartzell , RLG Corporate Communications. Please send comments and questions about this or other issues to the RLG DigiNews editors. Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); Technical Researcher: Richard Entlich; Contributor: Erica Olsen; Copy Editor: Martha Crowe; Production Coordinator: Carla DeMello; Assistant: Valerie Jacoski. All links in this issue were confirmed accurate as of October 15, 2003.
|
||
| |
|
|
|
|
|
|
|
|