April 15, 2003, Volume 7, Number 2
ISSN 1093-5371

 

FAQ

Squeezing More Life Out of Bitonal Files: A Study of Black and White. Part II.

Your editor's interview in the December 2002 RLG DigiNews states that JPEG 2000 can save space and replace the multitude of file formats used for conversion and display of cultural heritage images but that it isn't suitable for bitonal material. We have lots of bitonal images. Is there anything similar available for them?

In part I of this FAQ we examined the rationale for bitonal scanning going back to 1990 and reaffirmed its continuing relevance for digital capture of certain types of cultural heritage materials. We also considered the potential advantages and disadvantages of migrating collections away from the popular but aging TIFF G4 bitonal imaging standard. Here in part II, we'll take a first look at some of the alternative bitonal file formats and compression schemes. Part III, to appear in the June 2003 issue of RLG DigiNews, will compare the quality and performance of some specific products on a range of document content, including text, halftones, and complex graphics.

The Contenders

Several image file formats and compression schemes are potential migration targets for existing TIFF G4 files. Here's a rundown of some of the most important options, presented in alphabetical order.

CPC (Cartesian Perceptual Compression)

Overview. Patented in 1991 by Cartesian Products, Inc., CPC is a proprietary compression scheme and image file format for bitonal images. Cartesian Products claims that CPC can compress substantially better than G4 and, though particularly well suited for text, that it outperforms G4 for all kinds of document content, including halftones. Unique amongst the technologies presented here, CPC does not have a lossless mode. Cartesian Products calls its method "nondegrading," meaning that after conversion to CPC, the original file can no longer be restored, but the differences cannot be perceived by the human eye (other vendors use the terms "visually lossless" and "perceptually lossless" for the same concept).

Advantages. CPC is a proven technology that has been adopted for some large bitonal image collections, such as JSTOR, which converted all its online journal holdings to CPC in 1997[1]. There is a list of major CPC users available online (scroll to the bottom of the Web page). Cartesian's claims of nondegraded compression have been verified by user preference tests conducted by ISO. CPC supports single- and multi-page documents. Its viewer is available for all major platforms.

Disadvantages. CPC is proprietary, though Cartesian Products makes available APIs (application programming interfaces) to facilitate development of software using the scheme. Cartesian also claims that it is “working with a number of vendors who will be releasing CPC-enabled products, encompassing a broad range of applications including Internet fax services, document distribution, educational assistance, and electronic libraries.” However, at the moment Cartesian is the sole source of CPC encoders and viewers. CPC's lack of a true lossless mode could be an issue for demanding preservation applications. CPC is only for bitonal images. The format is not Web native and requires the installation of a special viewer if the CPC files are to be used for display purposes. CPC offers limited metadata capacity.

DjVu

Overview. DjVu (pronounced like déjà vu) was developed by AT&T Labs in 1996 with the first publicly released products coming in 1998. DjVu is designed to be a comprehensive, all-in-one document solution, suitable for bitonal text as well as gray scale and color content. DjVu defines a document format and encompasses several different compression schemes. A layering scheme allows documents that combine text and continuous tone content to treat each separately for optimal compression and display. AT&T Labs sold the rights to DjVu to LizardTech, Inc. in 2000. The independent PlanetDjVu Web site is an excellent source of information on all things DjVu.

Advantages. Lossy (claimed visually lossless) and true lossless compression of bitonal images, both claiming considerably better compression than G4. Also handles gray scale and color. Viewer is available for all major platforms. Handles single- and multi-page documents. In December 2001 LizardTech released partial open source of the v3.0 DjVu Reference Library, and others further enhanced that library.

Disadvantages. DjVu is proprietary, though LizardTech makes available SDKs (software developer kits) to facilitate development of software for encoding and decoding. LizardTech will license the DjVu Reference Library only for noncommercial use. At this time LizardTech is the sole source of commercial DjVu products that adhere to the current standard. Though DjVu clearly has some very enthusiastic supporters, its adoption has been spotty. The DjVu Zone Web site (which has not been updated in over a year) includes an outdated list of current users. Two of the largest users cited, Heritage Microfilm's Historical Newspaper Archive and UMI's Early English Books Online have abandoned display of DjVu images in favor of PDF and GIF, respectively. It also offers limited metadata capacity. The format is not Web native and requires the installation of a browser plug-in for display purposes.

JBIG2

Overview. Developed by the Joint Bi-Level Interest Group, JBIG2 is a new compression scheme for bitonal images that became an ISO standard at the end of 2001. It is the only contender mentioned here that is an international standard. According to the introduction of the draft JBIG2 standard, "the design goal for JBIG2 was to allow for lossless compression performance better than that of the existing standards, and to allow for lossy compression at much higher compression ratios than the lossless ratios of the existing standards, with almost no visible degradation of quality." JBIG2 is a relatively new standard that is now starting to appear in commercial products.

Advantages. Nonproprietary. Supports both lossy and lossless compression of bitonal images, including a special mode for halftones. Considerably better compression than G4, especially for halftone images. Can theoretically be incorporated into several existing file formats, such as TIFF and PDF.

Disadvantages. JBIG2 is strictly a compression scheme, so it is up to developers to incorporate it into existing file formats. Certain functionality, such as metadata, depends on what file format is used. Some applications can now produce JBIG2-encoded PDFs, but only Acrobat Reader 5.0 and later can decode them, potentially limiting user access. An open source decoder is being worked on but is in the very early stages of development.

PDF

Overview. Adobe's PDF is itself neither an image file format, nor an image compression scheme. However, PDF can serve as a container for digital images compressed with several schemes, including G4 and JBIG2. PDF has been around since 1993 and has evolved over the years as the leading format for online distribution of complex documents.

Advantages. PDF supports single- or multi-page documents. For example, although it doesn't reduce their size significantly, individual TIFF G4s can be bundled into multi-page PDF G4s, making them directly accessible to most Web users and automatically gaining all the navigation and display control offered by Acrobat Reader. Though proprietary, Adobe has maintained PDF as an open specification, resulting in a substantial level of third-party support. A free viewer is available for all major computing platforms. PDF is well established and is now the subject of a fledgling effort called PDF/A to "develop an international standard that defines the use of the Portable Document Format (PDF) for archiving and preserving documents." See also "Archiving and Preserving PDF Files," by John Mark Ockerbloom, in the February 2001 issue of RLG DigiNews.

Disadvantages. Despite the open specification, PDF is still a proprietary format. Acrobat Reader can decode JBIG2-encoded images, but only since version 5.0. Users who haven't upgraded to version 5 will get an error message if they attempt to read a JBIG2-encoded PDF. There are hundreds of tools for converting to PDF, but they must be evaluated carefully since there is considerable variation in the quality of output and efficiency of display. It needs better metadata capability (PDF/A may address this). The format is not Web native and requires the installation of a special viewer, though Acrobat Reader and the Acrobat browser plug-in are widely deployed.

To be continued ….

Are any of these migration targets for existing TIFF G4 images appropriate for your collection? Much will depend on institutional circumstances and priorities and the nature of the documents in question. In part III of this FAQ, to be published in the next issue of RLG DigiNews, we'll look at some specific implementations and reassess migration risk in light of all our findings.

Richard Entlich

Footnote

[1] JSTOR still scans to TIFF G4 and considers those files its preservation masters. The CPC files are used to reduce storage requirements for its online collection but are converted to GIF for display and to PDF for printing. (back)


Publishing Information

RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org.

Materials in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given to use material found here for research purposes or private study. When citing RLG DigiNews, include the article title and author referenced plus "RLG DigiNews,

." Any uses other than for research or private study require written permission from RLG and/or the author of the article. To receive this, and prior to using RLG DigiNews contents in any presentations or materials you share with others, please contact Jennifer Hartzell (jlh@notes.rlg.org), RLG Corporate Communications.

Please send comments and questions about this or other issues to the RLG DigiNews editors.

Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); Technical Researcher: Richard Entlich; Contributor: Erica Olsen; Copy Editor: Martha Crowe; Production Coordinator: Carla DeMello; Assistant: Valerie Jacoski.

All links in this issue were confirmed accurate as of April 15, 2003.


   
 
RLG DigiNews
BROWSE ISSUES
SEARCH
RLG