![]() |
||
| February 15, 2003, Volume 7, Number 1 | ISSN 1093-5371 |
|
|
Squeezing more life out of bitonal files: a study of black and white Part I Your editor's interview in the December 2002 RLG DigiNews states that JPEG 2000 can save space and replace the multitude of file formats being used for conversion and display of cultural heritage images, but that it isn't suitable for bitonal material. We have lots of bitonal images. Is there anything similar available for them? Editors’ Note: There are significant parallels in recent developments for color and bitonal images. In both cases, many of the existing preservation master files held by institutions are TIFFs (Tagged Image File Format), which are almost always converted to some other format for access purposes (generally JPEGs or GIFs). In both cases, new file formats and compression schemes offer the potential for reduced storage space, improved transmission, and greater functionality for users. And in both cases, decisions about whether to take advantage of new developments, and if so, when and how, are difficult and complex. Bitonal Scanning Background A bitonal image epitomizes the notion of a bitmap. Each dot or pixel captured by the scanner is mapped to a single bit, which can take on the binary value of one or zero. Though those two values could conceivably correspond to any pair of display or output colors, traditionally they are represented by black and white. Early experiments with scanning of cultural heritage materials (from around 1990) focused on methods for reformatting brittle books from the 19th and early 20th century. Bitonal scanning was a natural choice, because most of the materials being scanned consisted primarily of text, which is suitable for bitonal capture. However, technical and budgetary considerations also contributed heavily to the choice. Scanning for preservation reformatting requires that the digitized image be a high-fidelity surrogate for the original document. Achieving such an outcome requires high-resolution scanning and the storage of resulting images with either no compression or lossless compression (compression that allows exact restoration of the original bitmap). Consider a book page approximately 6" wide and 9" high consisting primarily of ordinary text. In 1990, a 600 dpi scan of that page could be losslessly compressed to about 100 KB. A similar 24-bit color file could be scanned at lower resolution, but would losslessly compress only to about 5 MB. In 1990, the highest capacity magnetic drives commonly available held 1-2 GB and cost about $2/MB. It was simply cost-prohibitive to consider gray scale or color scanning of large collections. Additionally, high-speed networks were only just starting to be deployed, and few home users had greater than 2400 bit per second access to the Internet. There was no practical way to move such large images around. Thus, despite its limitations, in 1990 bitonal scanning was the only affordable, technically feasible way to scan large collections. Since the early 1990s, cultural heritage institutions have bitonally scanned millions of pages of brittle monographs mostly using the TIFF and Group 4 fax (now known as ITU-T.6) compression (hereafter we refer to these files as "TIFF G4s."). Bitonal scanning has also been applied to other kinds of printed matter with less demanding requirements. As the Web became widely available in the mid-1990s, institutions starting making their bitonal images Web accessible. Even then it wasn't feasible to ship users the 600 dpi 1-bit image, owing to the lack of display hardware (i.e. high-resolution monitors), CPU power and display software to handle such fare. Additionally, bandwidth limitations and the fact that TIFF images are not a native Web format dictated that the images be served up one-at-a-time, scaled in size, and converted to a Web-native image format, such as GIF. Fast Forward to the Present A lot has changed since 1990. Many of the limitations from that time have been overcome. Mass storage costs have plummeted by a factor of 1,000. High-speed networks abound, and many end users have broadband access to the Internet. A number of new file formats and compression schemes have been developed, allowing 600 dpi, 1-bit image files to be compressed by an additional two to ten times over what was previously possible. Given that the latest major revision of TIFF dates from 1992 and the G4 compression scheme is even older (it became an official recommendation in 1988), it is perfectly valid to question the continuing use of these standards, and whether existing files should be migrated to newer formats. In fact, in view of the above, some may question whether bitonal scanning is still merited, since gray scale and color scanning produce richer, more tonally subtle output. However, a number of countervailing forces suggest otherwise. For example, our hunger for digital storage has managed to keep up with the decline in its cost. As digital collections grow, that magical time when mass storage will be so cheap it will be "unmetered" has remained elusive. In the realm of lossless compression, important strides in the compression of bitonal images have maintained the gap between their storage requirements and those of gray scale and color images. Another influence is the desire to bundle multiple images together in the form of complete journal articles, book chapters, and pamphlets. Aggregating images is clearly desirable for certain kinds of publications, but can strain network capacity (and users' patience). Though network hardware is getting faster, increased demand tends to dull the impact of increased capacity, so that attention to efficient use of the resource remains important. In addition, emerging wireless networks often have lower bandwidth than their wired cousins. Finally, much monograph and journal scanning requires only a high level of content fidelity, not perfect tonal reproduction. Properly executed, bitonal scanning is still quite appropriate for much source material. The Digital Library Federation has endorsed a minimum benchmark for digital reproductions of monographs and serials of 600 dpi 1-bit for black-and-white text, simple line drawings, and descreened halftones. The Migration Dilemma Even though bitonal images clearly still have an important role to play, the question of whether such images should continue to be created and stored as TIFF G4s remains. Institutions with large collections of TIFF G4 images face many potential motivations for migrating to newer file formats and compression schemes. They can
Some questions that might be part of an institutional self-assessment of whether to migrate include
Despite getting somewhat long in the tooth, the TIFF file format and G4 compression scheme are still widely used and well supported by scanning hardware and software, as well as by image processing and display applications. So for now, the motivation for most institutions will center on the potential benefits of a new format, rather than fear of loss from continuing use of the old. Institutional circumstances and priorities will play a big role in weighing the pros and cons of migration. For example, how large is the existing collection of TIFF G4s? How important is it to maintain absolute fidelity to the originals? Is there a commitment to long-term retention, or are the images for temporary use? Are users clamoring for more functionality or better performance? How important is reducing expenses for storage, backup, and network transmission? Does the institution want to make its high-resolution master files available to end users (many do not)? We cannot answer these questions for you, but we can lay some of the groundwork to help you start considering a few of the motivating factors and potential consequences. Migration Considerations Retrospective conversion vs. going forward If a decision is made to move to a new format, will all existing images be converted to it, or just newly created ones? The former situation requires very careful consideration, such as whether existing metadata internal to the file will transfer and continue to function. TIFF files offer only a rudimentary header, but some of the newer formats have even less to offer. Institutions must also examine the impact of a wholesale migration on existing systems and processes and on the user population. Some level of disruption and inconvenience is inevitable. Applying a new format only to new material has its own drawbacks, since it creates a divided collection and the need to maintain additional formats. It may also mean that part of the collection has different functionality than the rest. Will users be able to make sense of that? Master vs. access versions Will the change apply to master files, access files, or both? One of the potential benefits of migrating is the ability to use the same file as master and access version. Most users now have hardware and software appropriate to handle high-resolution images. With sufficient compression, and the availability of a file viewer on the user's computer that handles scaling, zooming, gray enhancement [1], etc., one can contemplate using 600 dpi bitonal files for access. However, many non-technical issues enter into decisions about master files, such as whether any degree of lossy compression can be tolerated, or whether the format is proprietary. Such dual use may be possible only for files that are not considered long-term institutional assets and that can be altered slightly without losing their value. Also, some institutions consider their high-resolution scans economic assets or may even be prohibited by copyright or contractual arrangements from making them available to end users. Using a new format solely for access, especially if that new format must be maintained online rather than created on the fly, may lessen the appeal of migration. For example, potential storage savings would probably not be realized. In some cases, migration for access only may be desirable if substantial improvement to the users' experience is the primary motivation. Single pages vs. bundles Bundling can be a component of migrating to a new format. For instance, although it doesn't reduce their size significantly, individual TIFF G4s can be bundled into multipage PDF G4s, making them directly accessible to most Web users, and automatically gaining all the navigation and display control offered by Adobe's Acrobat Reader. Though an image collection may lend itself to bundling, there may not be any one correct way to do it. Should journal images be bundled by article, by issue, or by volume? Viewing one page at a time may be constraining, but so may having pages bundled in what seems like too small, too large, or just the wrong configuration. It is also possible to create page bundles on-the-fly, giving users the option to customize. To be continued…. Several image file formats and compression schemes, some old and some quite new, are potential migration targets for existing TIFF G4 files. In part II of this FAQ, to be published in the April issue, we'll provide a brief rundown of some of the most important alternatives and examine some of their pros and cons. In part III, to be published in the June issue, we'll take a closer look at specific implementations, including features and performance. —Richard Entlich Footnote
Publishing Information RLG DigiNews (ISSN 1093-5371) is a newsletter conceived by the members of the Research Libraries Group's PRESERV community. Funded in part by the Council on Library and Information Resources (CLIR) 1998-2000, it is available internationally via the RLG PRESERV Web site. It will be published six times in 2003. Materials contained in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given for the material in RLG DigiNews to be used for research purposes or private study. RLG asks that you observe the following conditions: Please cite the individual author and RLG DigiNews (please cite URL of the article) when using the material; please contact Jennifer Hartzell, RLG Corporate Communications, when citing RLG DigiNews. Any use other than for research or private study of these materials requires prior written authorization from RLG, Inc. and/or the author of the article. RLG DigiNews is produced for the Research Libraries Group, Inc. (RLG) by the staff of the Department of Research, Cornell University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern; Production Editor, Martha Crowe; Associate Editor, Robin Dale (RLG); Technical Researchers, Richard Entlich and Peter Botticelli; Technical Coordinator, Carla DeMello; Technical Assistant, Valerie Jacoski. All links in this issue were confirmed accurate as of February 15, 2003. Please send your comments and questions to RLG Diginews Editorial Staff.
|
||
| |
|
|
|
|
|
|
|
|