RLG DigiNews
BROWSE ISSUES
SEARCH
RLG
 
   
December 15, 2002, Volume 6, Number 6
ISSN 1093-5371


Table of Contents

Editors' Interview
JPEG 2000, an interview with Dr. Daniel Lee

Feature Article 1
From Oral Tradition to Digital Collectives: Information Access and Technology in Contemporary Native American Culture, Kari R. Smith

Highlighted Web Site
The Diffuse Project Home Page

FAQ
In Search of Lost Pages: Stemming the Tide of Broken Links, by Richard Entlich

Calendar of Events

Announcements

RLG News

 


print this article

Editor's Interview

JPEG 2000

Dr. Daniel Lee
ISO SC29/WG1 (JPEG)
dlee@yahoo-inc.com

Editors’ Note
This interview is with Dr. Daniel Lee, who currently convenes ISO’s JPEG group. The ISO SC29/WG1 committee covers JPEG and JBIG. For those of you who are not familiar with the standard, you may want to visit the JPEG 2000 Web site for a brief overview of the content of the standard and links to related information. The JPEG 2000 Source site provides additional background information on who is involved in the development. We anticipate including items in future issues on the implications for implementing this standard and related topics.

Could you briefly describe JPEG 2000? Why should our readers pay attention to JPEG 2000? Who have the major players been in the development of JPEG 2000? What standards bodies are involved? 

JPEG 2000 is a new image coding standard developed by the International Standardization Organization (ISO) that serves the needs for imaging applications that the original JPEG standard does not meet. It is also an International Telecommunications Union (ITU) standard, so it has both an ISO and ITU number. It is hence designated as ISO/IEC 15444 and ITU-T T.800 series of standards.

... the cumulative work of over 100 imaging experts from 18 countries and represents 
         state-of-the-art-development in coding community.

JPEG 2000 is important because it is the cumulative work of over 100 imaging experts from 18 countries and represents state-of-the-art development in the image coding community. The players include leaders in the digital imaging industry, leading academic and industry research institutions, and government research institutions and agencies. 

How are JPEG 2000 software or vendor products certified as JPEG 2000 compliant?

Part 4 of the JPEG 2000 standard (ISO 15444-4) deals with conformance testing, where procedures for compliance to JPEG 2000 implementations are defined. 

How well suited is JPEG 2000 for the range of cultural objects (e.g., texts, manuscripts, photographs, art objects, etc.) being digitized by libraries, archives, and museums today? Is it, for example, a good choice for text and other image data that are characterized by edge detail? Is it a good alternative to Group 4 for 1-bit data

One of the design requirements for JPEG 2000 when it was first conceived was that the standard handle a wide variety of images, including those used to capture cultural objects like texts, manuscripts, photographs, art objects, etc. The technology adopted by JPEG 2000 renders itself to handle a wide range of image fields, including image data that are characterized by edge detail. It is not meant to be a replacement for Group 4 Facsimile standard, which is designed specifically for binary images (e.g., black-and-white texts). Even for those kinds of data, I would refer the readers to a newer and much more powerful standard, called the JBIG 2 standard, which is also developed under the same ISO standardization body that developed JPEG 2000.

JPEG 2000 offers many features not provided by the current JPEG standard, including support for multiple resolutions, tiling, and region-of-interest coding. Can you describe the advantages of these features and what kinds of cultural heritage images might benefit most from their use?

There are indeed many new features in JPEG 2000 that offer advantages over the current JPEG standard. In the context of cultural heritage images, support of multiple resolutions enables very effective image archive applications, so that a person viewing a particular spatial region of the image can extract (view) and drill down on the details of that region without spending computing and networking resources to decode the entire image.

Tiling support enables encoding of a very large image with extremely fine resolution that is more powerful than the current JPEG standard. Region-of-interest coding enables the optimal use of encoding and networking resources to preserve certain regions of image data with as much detail (resolution) as intended without assigning the same resource to the entire image.

How good is JPEG 2000's compression efficiency, and how well does it retain image quality with increasing compression? How do these features compare with JPEG and other common image file formats? Does JPEG 2000 offer both lossless and lossy compression modes? Does its compression work well with various bit-depth files? 

JPEG 2000 is designed to achieve excellent compression efficiency by using very advanced image coding techniques. It is particularly designed to retain the best image quality at each compression level. In technical terms, this is called rate-distortion optimization. Compared to current JPEG and other common image file formats, we have seen better compression efficiency, anywhere from 30% to 60%, bearing in mind that compression efficiency is image dependent.

JPEG 2000 offers not only lossless and lossy compression modes, it is uniquely designed such that an application can turn from a high-compression (lossy) mode down to lossless mode continuously.

JPEG 2000 works well at a wide range of image bit depths, as this is also one of the design features of the standard.

JPEG 2000 is still a work in progress. Some parts are incomplete, and those sections that have been completed have been subject to several technical corrections and amendments. Is JPEG 2000 ready for still-image users, who require stability and contemplate a long-term commitment to their images?

the standard as designed 
    is expected to serve image applications for at least next 10 years.

JPEG 2000 Part 1: Core System has been published for over a year now. Several minor technical corrections and amendments have since been published. It is absolutely ready for use. As an official ISO (jointly with ITU) standard, it is subject to a rigorous maintenance process to ensure that the standard is well maintained over its lifetime. The standard as designed is expected to serve image applications for at least the next 10 years.

A growing number of cultural heritage institutions have committed to the long-term maintenance of large image collections. Before migrating to another format, they will want assurance that the new format offers compelling functionality to merit the effort and expense of switching. A number of file formats, (e.g., MrSID, DjVu, LuraWave) offering wavelet compression and some of the same features as JPEG 2000 have appeared in recent years. Is there anything about JPEG 2000 that distinguishes it from its competitors and deserves special consideration as a replacement for these?

JPEG 2000 is based on the most-advanced image coding system. The technology that is incorporated into the standard has gone through rigorous testing to ensure that the performance goals are met. It has a rich set of features that can be considered tools that image applications can apply to provide the best solutions for a wide range of imaging needs. The standard has a set of compliance measures to ensure compatibility and interoperability. Hence it is the “tried and true” image standard. Indeed, many vendors are adopting JPEG 2000 and converting their proprietary formats to JPEG 2000.

Many cultural heritage institutions store their master image files in TIFF format, while using JPEGs to deliver the same images over the Web. Does JPEG 2000 have the potential to offer all these features in a single format and obviate the need to maintain two sets of files for each image? Should cultural heritage institutions consider using JPEG 2000 as a single preservation and delivery format?

JPEG 2000 offers a very comprehensive set of features in a single format and obviates the need to maintain two sets of files for each image. I definitely recommend that cultural heritage institutions use JPEG 2000 as a single preservation and delivery format.

What kinds of tools are available right now for the creation, manipulation, and delivery of JPEG 2000 images?

A variety of software tools—from many countries—are now available for creation, manipulation, and delivery of JPEG 2000 images. These tools include encoding, decoding, file conversions, and online image manipulation tools. A search in the World Wide Web will show where one can find these tools. 

Some examples:

Leadtools JPEG2000 Plugin

LuraWave.jp2 Browser Plug-in

LuraWave.jp2 Photoshop Plug-In

j2k by Fnordware

ImagePress JPEG2000™ Photoshop Plug-in

Elysium, ltd. JPEG 2000 plugin

At the Image Compression Symposium, Steve Kerr and Bernie Brower [see resources listed at end] suggested that: “JPEG 2000 is more than a change in compression.  It is a paradigm shift in how we collect, store, transmit, and use image information.” How would you respond to that statement?

Yes, this is indeed a shift from the traditional image coding paradigm present when the original JPEG was developed. In a distributed world, where there is a high level of connectivity, it is a paradigm in which images and the resources to support the imaging applications are distributed. One no longer deals with sending and receiving images from just one source to another. Images can be coming from multiple sources, and the resources to manipulate, process, and perform applications are also coming from multiple sources. We need a new approach to handling image data so as to accommodate future applications in this new paradigm.

Could you describe ISO's overarching strategy for using scaleable approaches for archiving and delivering multimedia? How does JPEG 2000 fit into that strategy?

JPEG 2000 has been designed with scalability as the top requirement—multiple solution, spatial scalability, quality scalability (so-called signal-to-noise scalability), bit-stream scalability, file format and metadata scalability. JPEG 2000 is the building block for archiving and delivery of visual media in the overall ISO standardization activity.

JPEG 2000 offers numerous features not available in JPEG. Yet experience has shown that technological superiority is not necessarily sufficient when it comes to acceptance of new formats. For example, PNG has numerous advantages over GIF, yet it has failed to win large numbers of converts. Other potential formats, such as Flashpix, simply failed to catch on. What is the JPEG 2000 committee doing to ensure that JPEG 2000 will be more quickly and fully embraced by users and toolmakers?

The JPEG 2000 committee understands that  technology alone cannot make a standard pervasive. A standard needs the support of the application developers and the user community. To that end, the committee has published the Part 5, ISO 15444-5, which is the reference software implementation for JPEG 2000, so that developers can use it for applications development. The experts from the committee have been actively organizing conferences and  seminars and participating in major imaging conferences to promote the development of JPEG 2000. Industry groups such as the I3A have adopted JPEG 2000 as their standard of choice and promote it in trade events and exhibitions. Researchers from academic institutions have published books and articles on JPEG 2000 that reach out to the research community. All these would certainly help the pervasive adoption of JPEG 2000.

One preservation requirement is to preserve the integrity of the original file (object) over time. A number of institutions have invested in creating "libraries" of TIFF RGB masters. What would happen in a TIFF RGB to JPEG 2000 to TIFF RGB transformation? Would the output match the input? If not, what would potentially be lost?

Preservation of the image quality is one of the key features in the development of the JPEG 2000 standard. In the JPEG 2000 file format, very detailed considerations have been given to the correctness and the preservation of the Color Space representation/transformations of the image data. This information is encapsulated into the file format so that when the image data are passed among different applications (e.g., from scanning to printing), the fidelity is never lost.

At the moment, decoding a J2K, JP2, or JPX file with a Web browser requires a plug-in. Widespread use and acceptance of JPEG 2000 will probably require high-quality native support in the mainstream browsers. PNG has had native browser support for over five years, yet some of its best features are still not properly supported in popular browsers. Has Microsoft or AOL (Netscape) committed to providing full native support for any of the JPEG 2000 file formats? If so, when do you expect it to appear?

The JPEG 2000 committee has regular contacts with major browser manufacturers and will continue to seek their support in adding JPEG 2000 decoders into the browsers’ native functions.

The official JPEG 2000 Web site indicates that the core coding system is not patent free, though it is intended to be royalty and license-fee free. Should potential users be concerned that future assertion of patent rights may cause problems, such as happened with GIF images?

Throughout the development process of JPEG 2000, the committee has followed the ISO guidelines concerning technology selection. The committee has obtained the generous offer of royalty and license-fee free conditions by its technology contributors. While one can never be certain that there will never be any assertion of patent rights that may cause problems in the future, there should be some degree of confidence that the standard as developed by so many experts from so many countries should be safe for users to adopt.

TIFF is widely used by cultural heritage institutions for master image files. The main TIFF specification hasn't been updated in over a decade. This may a blessing, as regular updates often introduce "feature creep" with resulting incompatibilities. On the other hand, a dead image format will eventually suffer obsolescence. Should users be resigned to the need for periodic migration to new formats? What are the signs that an image format is reaching the end of its life expectancy?

The clear sign that an image format is reaching the end of its life expectancy is that it no longer supports new applications, particularly those that come out from new paradigms of application environments.

Suggested Resources

"Library Potential Impacts," by Steve Kerr and Bernie Brower, Image Compression Symposium.

"ISO JPEG 2000 Standards Efforts," by Gordon Ferrari, Image Compression Symposium.

"NITFS JPEG 2000 Implementation Schedule/Events," Bandwidth Compression Symposium.

"The Next Generation of Compression JPEG 2000," by Bernie Brower, Image Compression Symposium.



print this article

From Oral Tradition to Digital Collectives: Information Access and Technology in Contemporary Native American Culture[1]

Kari R. Smith
Columbia University
krs2002@columbia.edu

For people who may live both physically and culturally distant from the majority culture in their immediate environment, information technology can provide a boost toward accessing and documenting their own heritage. As early adopters of the Web, Native Americans began using the Internet for e-commerce and cultural outreach in the early 1990s. The University of Michigan School of Information (SI), through internships and workshop classes held since 1997, has been exploring ways that digital technology can facilitate appropriate access and greater participation in cultural heritage documentation and presentation in tribal colleges and communities across the United States.

The Cultural Heritage Preservation Institute (CHPI) and its research component, the Digital Collective, were developed by SI Professor Maurita Peterson Holland and the author working with Native American community leaders, educators, cultural experts, and SI graduate students. These efforts culminated in 2001 with an international meeting in Hilo, Hawaii, of indigenous culture and technology specialists; library, museum, and archives professionals; funders; and digital library researchers. At this three-day meeting convened by SI, issues were discussed involving the use of information technology in preserving, documenting, and participating in indigenous cultures.

The Cultural Heritage Preservation Institute

In 1997 a middle-school teacher on the Navajo Nation asked Holland to consider ways that SI could collaborate with K-12 schools and the tribal college to use information technology to enhance cultural education in the classroom. Holland was a key contact at SI, both as the director of the Academic Outreach Programs and as the faculty member instrumental in coordinating the internships on the Navajo Nation. 

The author, who had recently returned from a sixteen-week internship at the Navajo Nation with an interest in developing culturally appropriate uses for information technology, worked with Holland as a graduate research assistant. By 1998 we had developed a plan and program for CHPI, a week-long technology and culture workshop for middle-school students and their teachers. Based on the teacher’s needs and our experiences on the Nation, there were several goals we wanted to achieve, among them:
  • create educational materials about Diné culture in Navajo voice
  • create primary materials by and about contemporary Native American people and their cultural heritage for future use and preservation
  • raise awareness of the role of museums and archives in preserving cultural materials.
CHPI Web page
Image 1  CHPI Web page

The Institute’s goals were to encourage effective student participation in the information society by providing equipment and technology skills, stimulate interest in a career in information science, encourage use of the Web as a community space, raise the awareness of the tribal college, and encourage the pursuit of higher education by K-12 students. SI acted as institute organizer, technology trainer, and facilitator, with Holland as the project director and the author as the project manager. Using the SI workshop framework, [2] graduate students created technology instructional materials, led discussions on how the projects could fit into classroom teaching, and worked one-on-one with the participants in the creation of the projects. The challenges for SI, as identified in its Final Report and Evaluation, included incorporating information technologies into educational modules for use in a middle-school classroom, bridging the perceived gap between traditional culture and modern technological life (use of information technology to teach about Diné culture), and creating and producing an institute of high cultural integrity and significance for all participants.

Several of the adult participants commented that although they had previously taken workshops on creating Web pages, CHPI was the most successful because there was a purpose to learning the technology skills.

CHPI at the Navajo Nation

The June 1998 institute held in Tsaile, Arizona, at Diné College on the Navajo Nation, was a great success. Twenty-two elementary and middle-school students and their teachers attended the week-long technology and culture workshop. They created educational Web projects and learned technology skills they could share once home. Each student became familiar with Diné College and learned not only about computers, the Internet, and creating Web pages, but also about Diné culture and history. The SI graduate students who taught the technology components gained valuable lessons in working in a challenging IT environment. They also learned about a new nation, culture, and language. Uniquely, the workshop was successful because of the combined approach of applying cultural education to technology skills. Several of the adult participants commented that although they had previously taken workshops on creating Web pages, CHPI was the most successful because there was a purpose to learning the technology skills.

Both the teachers and their students received instruction on Internet basics that included browsing, searching, and critical evaluation of Web sites. They also learned how to make basic Web pages, use a Kodak DC210 digital camera, [3] and scan and edit digital images. They developed skills to use the Internet and other digital technology tools to share their heritage with others. By the end of the institute each group of participants created a Web site project based on Diné culture using the information and skills they had acquired.

1998 CHPI Web page
Image 2   Web page of the 1998 CHPI at the Navajo Nation

The participants toured the Ned A. Hatathli Museum on the college campus, which describes and displays collections of Navajo and other Native American artifacts from a Navajo point of view.  In addition, they were able to experience Diné cultural heritage throughout the week from demonstrations and lectures by Diné artisans (woodcarving, pottery, basketry, and silversmithing). A guided tour of nearby Canyon de Chelly National Monument gave them a chance to learn about the historical and cultural significance of the canyon and enjoy its natural beauty. One night a local astronomer set up a telescope so the participants could see a special astronomical event and explained the stars and constellations in the Navajo sky.

During the final two days of the institute, participants designed and created Web-based projects that were to be the basis for ongoing education and curriculum development of cultural heritage education and community heritage documentation. For example, as part of their project on native plants, students from Kayenta Middle School drew pictures and took digital photographs of actual plants, then discussed native uses and stories of the plants on the Web pages they created.

Sage Web page CHPI 1998
Image 3 Web page created by Kayenta students

On the last day of the institute, each participant took part in a public presentation of individual projects in the Diné College Museum attended by the president of Diné College, faculty, and guests.

CHPI in Michigan’s Upper Peninsula

The second CHPI was held in 1999 in Ann Arbor and in the Upper Peninsula of Michigan. Incorporating feedback and lessons learned from 1998, the institute was presented as two workshops. The first focused on digital technology and the Internet and was held at SI.  The second, held in the Upper Peninsula in late June, focused on learning about and documenting Ojibwa cultural heritage. The participants, in the eighth through twelfth grades, were from the Upper Peninsula's Sault Ste. Marie Tribe of Chippewa Indians and the Bay Mills Indian Community.

CHPI 1999
Image 4  CHPI 1999 Web page

The CHPI participants in 1999 were Internet and Web savvy. The concerns and issues they addressed in their online projects were more about content than process. They were concerned with making sure information available on the Web was culturally sensitive, accurate, and from the point of view of the community and that no sacred, secret, or sensitive information was presented. These intentions were specifically manifested in Web pages created about an ancient tribal burial site visited during the institute field trip. Each person was careful to make sure there were no identifiable landmarks in pictures they took and that there was no information about how to get to the location on their Web pages. Those precautions were very important because ancient burial sites are often targeted by artifact hunters who might use the Web to find such sites. Participants also wanted to use their Ojibwa language and were interested in incorporating audio files into the Web site in the future. Their projects were all based on the theme of “Geography, History, and Culture of Michigan’s Upper Peninsula.”

The CHPI Becomes a Model

After the second institute we wanted to go beyond a project-based approach to documenting culture and adapt our experience gained through CHPI to a more-extensible model for digital access and preservation of culture-based knowledge. As an archivist, the author wanted to build into the CHPI a process to capture, describe, and preserve the digital images, drawings, texts, and new knowledge created during the institutes.

CHPI Model
Image 5  Initial model of the Digital Collective

Some of the questions we addressed in our research were:

How can the components of Web pages be reused and records and images that did not make it onto Web pages be kept for later access by a larger community? 

How can these materials be used and shared, especially by and for indigenous peoples?

Can a new model for information creation, storage, description, and access be created outside the usual boundaries of archives, libraries, and museums?

Who should be involved in designing and implementing such a system?

What we developed was a process and model called the Digital Collective: a model for storing and accessing shared information and knowledge, as well as for creating new knowledge and recreating global memory, and a place where people share personal and professional information and where they seek connections and build community. The Digital Collective’s power is in the people’s sharing of perspective, recollections, augmentations, and facts in the language of their culture. [4]

In a presentation at the New Information Technologies Conference about digital libraries in China in 2001, we developed five basic principles [5] of the Digital Collective model.

Principle One.  The digital library must be inclusive of all formats and digital instantiations. In so doing, it must adjust its definition to attend to new forms of information that exist only in virtual space.

Principle Two.  Institutions of memory must work together to carry out their responsibility.

Principle Three.  Research about learning will reshape how people use information and create new knowledge.

Principle Four.  The Web offers two-way communication and publication. Every user of information is also a potential producer. Therefore, stakeholders can become collaborators in describing and sharing artifacts and experience.

Principle Five.  Employing principles one through four, we can define a new digital library model: the Digital Collective.

CHPI Model 2001
Image 6  Model of the Digital Collective presented in 2001

The Digital Collective is not a digital library, nor a virtual museum, nor an electronic archive. It is a complex system for storing, describing, accessing, and using digitized multiformat materials. Unlike most existing digital libraries or virtual museums that are created and populated by experts from institutions, the Digital Collective is a community space where nonexperts and ordinary people can enter their digital objects along with their information, stories, and experiences about their own or other objects in the collective database.  

The Digital Collective is special because it uses experts to catalog, describe, organize, and produce products from the multimedia objects in a Digital Collective, as well as nonexperts who have personal knowledge and objects and are interested in sharing these, in drawing connections among other objects, and in interacting with others regardless of time or space. Another distinctive benefit of the Digital Collective is its inherent focus on digital objects and therefore born-digital objects. Most institutions of memory that are creating databases for digital access to their collections are digitizing physical objects for greater or remote access.  The Digital Collective model includes born-digital objects created by artists, writers, and musicians, as well as physical objects that have been digitized. 

Another important aspect of the Digital Collective is its attempt to collect current objects and information, rather then wait until materials reach an archive, library, or museum. By allowing people to self-select materials to contribute to a collective and add their own knowledge about materials already in it, the Collective becomes a record of society itself in addition to a rich repository of objects and descriptive information.

Global Discussion

We recognized a need for global discussion about initiatives to document and preserve indigenous culture and extend the conversation about the Digital Collective model. Working with Professor Daniel Atkins at SI, and with funding from the National Science Foundation and further support from the W. K. Kellogg Foundation, in 2001 SI convened a meeting, facilitated by The Grove Consultants, to share ideas and explore connections about the role of information technology in celebrating and extending culture. [6]

Invited attendees of the two-day meeting in Hilo, Hawaii, included Native North Americans (including Alaskan, Hawaiian, and Canadian); Maori, Australian Aboriginals, Sami, and Brazilian and African participants; relevant cultural institutions; academic experts in research and technology; and potential funders. All attendees shared an interest in cultural preservation, education, and sharing through information technology.

Each of the thirty-two attendees were asked five key questions before arriving at the meeting:

  1. What are the major issues for indigenous people in creating and accessing digital resources?
  2. Can we define and develop global digital collectives using collaborative technology that will enhance sharing, stimulate knowledge creating, and provide venues for research?
  3. In what ways can digital collectives and collaboratory spaces educate and preserve culture?
  4. How can digital initiatives leverage job creation, training, and, therefore, economic development?
  5. What are the appropriate roles for institutions of social memory, and how might they work together?

The discussions at the meeting centered on the topic “The Use of the Web in Indigenous Communities.” Based on initial discussion, three threads of interest were identified: preservation, technology, and networking. Participants illustrated their vision of the effect that the application of information technology could have on indigenous communities. Following that, a discussion of themes and threads drew together the critical points.  

Vision statements
Image 7  Hilo participants discuss their vision statements during the meeting

On the second day, the participants shared “what I can do” and “what I have to offer.” This led to defining major topics for discussions in working groups: 1) developing a guide for indigenous communities that want to use the Web and 2) building a global indigenous library.[7]

Several successful collaborations have taken place since the meeting. In addition, there have been discussions at funding agencies about developing new approaches to support indigenous projects. Work is continuing at SI and other institutions in applying the Digital Collective model for cultural preservation and access.

Personal Reflections

The experience reported in this paper began for the author in summer 1997 with a sixteen-week SI-sponsored internship on the Navajo Nation at Navajo Community College—now Diné College—in Tsaile, Arizona. [8]

Aerial view of the Diné College campus in 1997
Image 8 Aerial view of the Diné College campus in 1997

In addition to working on various information management and archives projects at the college and with the Navajo Nation Library System, in Window Rock, Arizona, there was ample opportunity to travel around the twenty-six thousand square mile reservation to learn about and visit other cultural collections and museums and archives.

A few of the places that impressed me in terms of their rich collections and potential were the Tuba City Museum, the Window Rock Tribal Museum and Library, and the Burger King in Kayenta with its WWII Code Talkers exhibit of original artifacts. This exhibit is perhaps the richest private collection on permanent display of WWII Navajo Code Talkers. The fact that it’s located in a fast-food restaurant means that a fair number of people visiting Monument Valley see it, but only people who visit Kayenta have this opportunity. In addition, there were various exhibitions and collections in trading posts across the Nation. Through conversations with people in many towns and trading posts, I was introduced to many rich depositories of information and cultural resources, as well as varying levels of technology used in the storage, organization, and dissemination of those resources. By the time I left the Nation, it was apparent that a directory of some sort would be useful for locating cultural collections across the Navajo Nation.

The Tribal Museum and Library in Window Rock, Arizona
Image 9  The Tribal Museum and Library in Window Rock, Arizona

I was fortunate that my internship was coordinated by Professor Holland. Some of the other projects of SI the author and Holland worked on in collaboration with tribal colleges in the U.S. include:

"Alternative Spring Break" library and archival graduate students at Native American Educational Services College in Chicago. Fourteen students spent a week processing archival materials and documenting and advising on information technology for the college campus.

School of Information and the American Indian Higher Education Consortium.  Internet public library technology customized for and by each tribal college. 

Crow Nation, Montana—Little Big Horn College. "Alternative Spring Break" and archival internships at the Tribal College Library and Archives.

Lessons Learned

During the four years that I worked with Native and indigenous communities, libraries, archives, and tribal colleges, I learned several valuable lessons. 

Listen: Be a “guide on the side,” rather than “sage on the stage,” and don’t assume “you” know “it.”

Communities come first. This is a very important criterion that is sometimes difficult to keep in focus when working on a research or pilot project.

Understand funding administrative requirements and their potential impact. Who will be granted the funds, and who will be the subcontractor for collaborative projects? What perception might this give the local community? Are the deadlines and reporting expectations realistic for the partners?

Work with Native timetables. Be realistic about how much time is needed to complete a project. Be flexible when unexpected circumstances arise. 

Know the technology constraints and capabilities of partnering institutions and locations.

Be clear about what you can provide and how you are an expert—and how you are not!

Become a part of the community in which you are working, including learning their history, customs, and expectations.

Future Challenges for the Profession

Indigenous people must be included in developing standards and technologies at national and international levels.

Education and training issues, such as funding, the ability to enter established programs, and time commitments need to be addressed. Increased access to indigenous knowledge has to be weighed carefully against the potential harvesting of that knowledge for commercial use.

Archives, libraries, and museums need to understand what they have in their collections, what ownership rights exist, and how and when the Native American Graves Protection and Repatriation Act (NAGPRA) may apply.


[1] This paper is modified from a presentation for the panel "Creating Web Access to the Cultural Record" at the Society of American Archivists Annual Conference in Birmingham, Alabama, in August 2002. Special thanks are due to Professor Maurita P. Holland, teacher, mentor, and colleague, for all her support. Much of this work was supported by funding from the W. K. Kellogg Foundation, the Microsoft Foundation, and the National Science Foundation, in addition to the School of Information at the University of Michigan in Ann Arbor, Michigan. [back]

[2] The W. K. Kellogg Foundation granted SI funds to re-invent the education of library schools for the twenty-first century. Part of this includes the Practical Engagement Program (PEP) is an integral part of SI's professional master's program. Designed to integrate the application of knowledge and skills to specific problems outside the classroom, PEP both enables and requires students to combine what they have learned in the classroom with what they observe and experience in the "real world." [back]

[3] Digital cameras and software for each school were provided through a grant from the W. K. Kellogg Foundation. [back]

[4] Holland, Maurita P., and Kari R. Smith. "Using Information Technology to Preserve and Sustain Cultural Heritage: The Digital Collective," in the 2000 UNESCO World Culture Report, (UNESCO Publishing: Paris, 2000). [back]

[5] Holland, Maurita P., and Kari R. Smith, "What the Digital Library Doesn’t Tell You: Exploring the Gaps and Opportunities," p. 89 in Global Digital Library Development in the New Millennium, ed. by Ching-chih Chen (Beijing: Tsinghua University Press, 2001). [back]

[6] The meeting was cosponsored by the American Indian Higher Education Consortium and the Smithsonian Institution Museum of the American Indian. [back]

[7] Digital Collectives in Indigenous Cultures and Communities, report of the meeting held August 10 and 11, 2001, in Hilo, Hawaii. Copies of the report may be requested from: School of Information, Office of Academic Outreach, University of Michigan, 304 West Hall, 550 East University Avenue, Ann Arbor, Michigan 28109-1092.[back]

[8] Diné College was established in 1968 as the first tribally controlled community college in the United States. In creating an institution of higher education, the Navajo Nation sought to encourage Navajo youth to become contributing members of the Navajo Nation and world society. [back]



Highlighted Web Site

The Diffuse Project Home Page

This is the Web site for the Diffuse Project, a European effort, coordinated by TIEKE, the Finnish Information Society Development Centre, to provide information on new and existing IT network specifications and standards.

The site offers news on standards development in the form of monthly and quarterly reports. It provides listings (including links) for many of the organizations responsible for IT standards. Also included is a series of informative “Business Guides” on applying standards in specific fields, including image and data compression, trust services, Internet privacy, interactive media, Web services, and virtual reality.

Most importantly, the site maintains a Standards and Specifications List, designed to be a comprehensive guide to IT standards. The list is organized by category, representing network communications, information management, data representation, e-commerce, and data sharing in medical, scientific, geographical, educational, and museum applications. In addition to being searchable, the site includes an alphabetical index of the standards and specifications covered in the Diffuse Project’s publications. As a Web site dedicated to furthering IT standardization, diffuse.org is a valuable reference guide and a source for news on this increasingly important aspect of the Web.

Diffuse logo



print this article
FAQ

Can you recommend any techniques for reducing the incidence of broken external links on my Web site, and for rediscovering resources that have moved?

There is conventional wisdom for both parts of this question. In our response we examine some of the subtleties involved in broken link detection and then demonstrate how well some of the oft-recommended strategies for locating lost Web content really work.

Web site integrity and preservation are increasing concerns in the library and archival communities, and the subject of several research and development efforts. The answer to this question is informed by work on Cornell's Project Prism, an NSF-funded DLI2 initiative, including investigations of integrity measures for external resources, the testing of tools for Web site monitoring, and the development of techniques for locating lost Web pages by utilizing lexical signatures and document similarity software.

Link rot, in which a significant percentage of links no longer work, is a common problem on the Web. It has also been noted that that on any site where links are not maintained or updated on a regular basis, the percentage of outdated links rises rapidly over time. For example, in a link analysis of RLG DigiNews conducted for the fifth anniversary issue, the percentage of bad links rose from approximately 10% after one year to nearly 40% after five years.

The causes of link rot are well-known. As sites grow and change, content is often reorganized or moved. In addition, domain names come and go, sometimes disabling all the existing links to a site simultaneously.

A variety of techniques, including HTTP redirects, html redirects ("meta refreshes"), and scripting languages such as JavaScript and PHP can be used to transport users relying on obsolete URLs to a new location, with or without prior warning that the content has moved. Other techniques are available to help stabilize Web content. Persistent addressing schemes such as PURL servers, DOIs (Digital Object Identifiers) and URNs (Uniform Resource Names) allow content to move around on local servers without affecting the functionality of existing links.

Unfortunately, none of these techniques provides a complete or permanent solution to the link rot problem. Redirects are fairly widely used, but are generally kept active for limited periods of time. Once the redirect is removed, all links that haven't been updated will fail. None of the permanent addressing schemes is widely used yet. In either case, the external links on a site are at the mercy of those who maintain the external sites.

Grander solutions have been contemplated. A British startup known as LinkGuard attempted to solve the link rot problem by creating and regularly updating a massive (40 Terabyte) map of all known Web links. The idea was to sell a service that would reference the link map to correct broken links as they were encountered. Unfortunately, the appetite of Web site maintainers to pay for the elimination of broken links proved less voracious than anticipated, and LinkGuard shut down in November 2001, leaving as one of its legacies several hundred bad links to its now defunct linkguard.com domain name.

Until another "big idea" comes along to solve the link rot problem once and for all, more modest techniques must be relied upon. The process requires two basic steps. First one must be able to detect when a link has gone bad, or even better, when it's about to go bad. Second, once a link is confirmed bad, one needs a mechanism to find a viable link with which to replace it. Ideally, a software tool would automate both steps and require minimal human intervention. Realistically, available tools for automated handling of the second step are still in their infancy and no one technique is suitable for all situations.

Step one: Identifying failed and endangered links

Bad external links can be detected using one of the many available link checker software applications or Web services. Typically the product is pointed at a Web page on your site and directed to check all external links. The resulting report will indicate the status of each link, including 404 errors (page not found), or those that required redirection, e.g., http status codes 301 (moved permanently) and 302 (moved temporarily). Sites that can't be reached at all won't return a status code, and may indicate either a temporary service outage or a more permanent loss of an entire site or domain.

However, there are typically more failed or endangered links than a simple link check reveals. Some custom 404 messages improperly report an http status code of 200 (indicating a good page), a phenomenon coined "phantom URLs" by Wallace Koehler. [1] Also, redirects that are moderated by meta refreshes or scripts may not be detected by some link checkers. Therefore, a Web monitoring package that can also detect changes in page size or report the presence of particular keywords (such as "moved," "forwarded," "new location," etc.) is desirable.

Step two: Safeguarding endangered links and replacing obsolete ones

Consider any redirect returned by an external link on your site to be an opportunity to avoid a future broken link. Most redirects exist on a temporary basis. As soon as you become aware that an external link is being redirected, substitute the new destination URL on your site.

That's pretty straightforward. But what if an object pointed to by an external link suddenly turns up missing, with no warning? A few years ago, the advice for finding the new location (if one indeed existed) would have included the following:

1. "trim" directories off the end of the URL to get a directory listing
2. if the site has an internal search engine, search for your document there
3. conduct a search on a general purpose search engine using components of the old URL

More recently, some additional tricks might be recommended, as well:

4. see if the document is stored in Google's cache
5. check for the document in the Internet Archive
6. conduct a search on a general purpose search engine using a "lexical signature"

How practical is it to use these techniques? How well do they work in reality? We conducted a small, informal study to examine the efficacy of each of these six techniques.

Testing techniques for finding missing links

For source URLs, we went back to the April 2002 link tests on RLG DigiNews and chose an issue about two years old that had, at the time, eight dead links (404) and two endangered links (302), for a total of ten links, on six unique sites. Nine of the links were to html pages, and one to a pdf (portable document format) document.

The results are shown in Table 1. Each link is given a numerical ID# in column 1, with letters appended for those sites that had more than one bad link. Column 2 shows the http status code reported in April 2002, while column 3 shows the http status code in December 2002. All of the eight that were dead back in April remained dead eight months later. Of the two that used redirects, one now comes back as dead, while the other is still a redirect, but the site it redirects to is now dead. In effect, the two redirects have both become dead.

As best we were able to determine, nine of the ten pages are still available on live Web sites (column 4) and of those nine, seven can be found in their original domain (column 5). Two could only be located in other domains.

We tried to find the original Web page or document by each of the six methods mentioned above (columns 6 through 11). The "success rate" percentages shown in the last row are based on the assumption that each method is evaluated independently. Therefore, even though, for example, it's impossible for a site search to succeed if a site has no internal search engine, we only record that the method failed to find that particular Web page or document.

Below the table, we present a detailed analysis of each technique. Some of the techniques described are only viable if the text of the Web page is available. Obviously having a copy in any form (even a printout) makes it easier to recognize that you've found the page referenced in the original link. But some techniques require that a machine-readable copy be available. Even if that copy is part of an archive or cache that you wouldn't want to link to, it may make it possible to use that text to find another copy that is usable as a linking source.


Table 1

Trimming the tree

This time-honored technique involves chopping off successively higher branches of the directory "tree" that forms the URL. For example, consider a page located at http://www.xyz.com/level1/level2/level3/file.html. The "trimming" technique would start by removing just the file name to see if a directory of level3 appears. If the desired page or a possible path to it doesn't appear, then chop off level3 and try level2, etc.

For our test set, trimming allowed us to rediscover seven of the ten original documents (the other three were no longer available on the original site). However, the process often did not go as originally envisioned. That's because most Web sites no longer allow file directories to be viewed, for security reasons. Thus, in most cases we had to return all the way to the home page and then browse the site, using knowledge of where the document had been previously stored (e.g., on a "publications" page) or which department within the organization originally produced it.

Success with this technique depends on both the logical organization of the site and the experience of the searcher. However, it's a low-tech technique that requires only a Web browser and some patience.

Site Search

Since most documents that go missing have merely moved elsewhere on a site, one would think that site search would be a very effective way to relocate lost documents. In our study, two of the ten documents were on a site lacking site search capability, but even so, site search claimed only a 50% success rate. Three pages didn't turn up because of mis-configured site search engines. Overall, our experience with this technique has been spotty. Besides mis-configurations, we've seen site search features with outdated indexes and poor quality search engines. Also, many sites these days perform site search by sending the search parameters to Google, but restricting the results by domain. Obviously such results will suffer whatever deficiencies Google does with respect to that site, including possibly incomplete or outdated indexing.

URL/file name Search

This technique involves submitting a portion of the URL (such a directory name or file name) to a general search engine. It is most likely to succeed for fairly distinctive names. We were able to rediscover half of the missing pages this way, using Google as the search engine (though to maximize the odds of finding a lost page, multiple search engines should be tried). For each of the successful searches, the target page was returned as the first hit, even though in some cases, the number of hits was very large (over 200,000 in one case). The other cases failed because in moving to new locations, the file names were changed.

Obviously, it is possible to search for a lost page using identifiers other than the directory name or file name. If a copy of the document is available, the title, keywords, and other terms can be used. We will discuss this technique further under lexical signatures, below.

Google's cache

In recent years, the Google search engine's cache of Web pages has become popular as a means to find copies of missing items, particularly material that has been taken down because it is controversial, embarrassing, or incriminating. The Google cache is most often used following a standard Google search, when the main link to a returned page is inaccessible. However, it is possible to search Google's cache directly. All that is required is to plug the desired URL into a specially formatted Google address. For example, to find the cached version of the October 2002 issue of RLG DigiNews, one would go to http://www.google.com/search?q=cache:www.rlg.org/preserv/diginews/diginews6-5.html.

In our study, the Google cache came up empty. None of our missing pages could be found there. This may well be due to the fact that most of the pages we looked for have been gone for at least 8 months. It is not clear how long a typical Google cache file is kept once the original page it represents is no longer available, but Google's cache may be primarily of value for pages that have gone missing fairly recently. Therefore, it might prove most valuable in the maintenance of sites that are being monitored regularly and where the attempt to find the replacement page occurs shortly after the original disappears.

Internet Archive Wayback Machine

The Wayback Machine became generally available in October 2001 and currently includes archived Web pages starting in 1996 running through the beginning of 2002. Though huge, the Internet Archive (IA), on which the Wayback Machine operates, is not comprehensive (see (RLG DigiNews June 2002) for details on various aspects of the IA). In particular, the IA is always at least 6 months behind in providing access to archived material. Thus, it makes a good complement to Google's cache for finding machine-readable copies of missing Web pages.

We were able to find all ten of our test pages in the IA. Since the IA actually encourages sites to link directly to the archived copies in its database, one could conceivably end a search for a missing Web page there. However, such a choice comes with significant drawbacks.

Many IA pages are incomplete, particularly missing image content. Also, of course, IA pages are frozen in time. If the content you're pointing to is still being updated, a live link to the current location is obviously preferable. Thus, linking to the IA should be reserved for material of historical interest that is available nowhere else.

However, finding the documents in the IA has considerable additional value. For example, it drives home how important it is to do timely updates of links that redirect to other sites. Of the ten pages in our test set, three showed evidence in the IA that they at one time provided redirects to the new location. Added to the redirect (302) that went dead between April and December 2002, fully 40% of the now dead links could have been avoided just by timely maintenance of redirections.

If a copy or printout of the page being linked to isn't otherwise available, the pages found in the IA provide text that can be used to more effectively search for a live copy by other means. Titles and keywords can be plugged into a general search engine. We also found that IA URLs can be plugged directly into lexical signature software, the final technique we used to locate the missing pages.

Lexical Signatures

Lexical signatures are a component of so-called Robust Hyperlinks, a Web link integrity concept developed by Thomas Phelps and Robert Wilensky at UC Berkeley. The notion of robust hyperlinks is quite simple. When linking to a Web page, the linking institution creates a "lexical signature" as part of the URL. Unlike arbitrarily assigned unique IDs, which rely on third party agencies and the voluntary participation of the content's creator, lexical signatures require only that the party linking to a Web page has access to the content.

The lexical signature is generated by determining how frequently the terms used in a particular Web page appear on the Web overall. Terms that occur infrequently, relative to the Internet as a whole , are preferred, especially if those terms appear frequently within the page of interest . Lexical signatures can consist of any number of terms, though five appears to be a good compromise between query effectiveness and search effort. A lexical signature may not be unique, but it may be distinct enough to help find the Web page again, should it go missing from its original location. Here's an example of a five-term lexical signature for the RLG home page: http://www.rlg.org?lexical-signature=encoded+archival+permissions+eureka +becoming.

The "big idea" behind robust hyperlinks is that with some additional functionality in Web browsers, broken links could, to some degree, made to be self-repairing. A link that returned a 404 error would be automatically searched on one or more search engines using the already attached lexical signature.

Beta level open source software for generating lexical signatures is available from UC Berkeley. It requires Java 2 v.1.3, which is available for Windows, Linux, and various flavors of Unix. We tested it on a PC running Windows 98. If you're curious about this software, be forewarned that it isn't quite "click and go" and requires a bit of tinkering to get working. The software is not graphical (it runs on the command line) and documentation is minimal.

In our small study, lexical signatures helped locate five of the ten missing pages, including a live version that didn't turn up by any other means. Of the five failures, one was a consequence of the document no longer being available, and two resulted from crashes of the software (did we mention that it's not quite ready from prime time?). The other two failures are more interesting, and point more clearly to the weaknesses of this approach. In one case, the Web page (a home page containing introductory text) changed considerably and most of the original lexical signature terms no longer appeared. In the last case, the page had moved to another domain, but was largely unchanged. However, one of the five terms chosen by the lexical signature software no longer appeared on the page and that's all it took to doom the effort.

In the cases where the lexical signature software crashed, we decided to try making up our own "lexical signatures" by applying some plain common sense about the English language. In both cases we were able to craft five word searches that brought up the correct new location as the first hit in a Google search. This is why having access to the text of the lost Web page helps, because you can probably assemble search terms based on the title, keywords, or other terms or phrases that will increase the precision of your search. Lexical signatures hold promise because of their potential to automate this otherwise tedious and time-consuming process.

Conclusion/recommendations

Broken links will continue to be a factor in Web site integrity for the foreseeable future. If you link to external sites and value the resources they represent, the burden to maintain the links is on you.

Scan external links regularly. Pages that return redirects should be updated as soon as possible. Be aware that not all redirects are easily detected and that monitoring that goes beyond looking at http status codes may be necessary to find some kinds, as well as to ferret out dead links that masquerade as good.

No single technique for rediscovering lost Web content is effective in all cases. Capturing a lexical signature at the time an external resource is linked to, whether one generated by software or handcrafted, can help relocate the resource if it moves without warning. If not done ahead of time, cached or archived copies can be used for the same purpose. In a pinch, even simple site browsing can be surprisingly effective in locating lost resources.

[1] Wallace Koehler, "Web Page Change and Persistence—A Four Year Longitudinal Study", Journal of the American Society for Information Science and Technology, v.53, no 2, (January 15, 2002), pp.162-171. [back]

--Richard Entlich

 

calendar of events

Calendar of Events

Sixth International Open Forum on Metadata Registries
January 20-24, 2003
Santa Fe, NM

The Open Forum for 2003 will present standards, tutorials, and practical experience about the following technologies:
  • Universal Description, Discovery, and Integration (UDDI)
  • OASIS/ebXML Registries
  • ISO/IEC 11179 Metadata Registries
  • Database Catalogs (e.g., relational DBMS/SQL)
  • Software Development Repositories
  • Software Component Registries
  • Terminology and Ontological Registries, and
  • Dublin Core Registries
The theme will be cooperation and interoperation of these technologies, and the conference will highlight the management of metadata and data semantics for Web services, XML data exchanges, data management, and other applications.

Web-Wise Conference on Libraries and Museums in the Digital World
February 26-28, 2003, Washington, DC

Sponsored by the Institute for Museum and Library Services and Johns Hopkins University, this year's theme is "Sustaining Digital Resources." Further information about the conference will be provided as plans are finalized. For further information contact: Laura Varricchione.

Practical Experiences in Digital Preservation
April 2-4, 2003
Kew, United Kingdom
The International Council of Archives Committee on Information Technology in conjunction with the Public Record Office, UK is sponsoring this conference that will provide an opportunity for archivists and information managers to discuss implementation issues and share experiences. Practical approaches to digital preservation will be demonstrated and will contribute to a growing pool of best practice. To register contact: digital-archive@pro.gov.uk.



Announcements

New Publications on Digital Preservation
The Digital Preservation Testbed (Testbed Digitale Bewaring) has recently posted a number of new publications about digital preservation on its Web site. They include XML and Digital Preservation, and XML Implementation Options for Emails. For more information on the Digital Preservation Testbed, see the June 2002 issue of RLG Diginews.

NINCH Guide to Good Practice
The National Initiative for a Networked Cultural Heritage is pleased to announce the release of the NINCH Guide to Good Practice in the Digital Representation and Management of Cultural Heritage Materials. The guide was created by practitioners working in different disciplines and media in museums, libraries, archives, the arts, and academic departments.

International Standard Being Developed for Archiving PDFs
A new joint activity has been initiated between the Association for Suppliers of Printing, Publishing and Converting Technologies (NPES), and the Association for Information and Image Management, International (AIIM International) to develop an international standard that defines the use of PDF for archiving and preserving documents.

The Public Record Office (PRO): Making the Nation’s Memory Available Online
The Public Record Office (PRO) has announced that it is seeking Licensed Internet Associates as part of its vision to provide electronic access, over the Internet, to digitized images of the documents it holds.



RLG News

Joint RLG-JISC Symposium: Selection and Collaboration in Digital Preservation
24-25 March 2003
Washington, DC USA

This joint symposium, sponsored by RLG and the UK Joint Information Systems Committee (JISC) and hosted by the Library of Congress, will address the critical issues of selection and collaboration in preserving digital materials.

Digital preservation is an international issue with many transferable lessons. Leading speakers from the USA and Europe will describe their experiences and future plans. Through presentations, discussions and breakout groups, participants will have opportunities to contrast different approaches, consider which approaches will be relevant for their own institution and interests, and to further explore opportunities for collaboration in digital preservation across organizational and national boundaries. This event is the latest in a series of collaborations between JISC and RLG begun in 1996 and resulting in conferences, research projects and publications.

Attendance at the conference itself will be free to staff of RLG member institutions and UK HE/FE institutions but numbers are limited and early booking is advised. We anticipate a small number of places will also be made available to others to attend on a cost recovery basis (non- members of UK HE/FE institutions or of RLG should register their interest in attending on the conference web page).

Accommodation, travel and meals (other than conference lunches and refreshments) will the responsibility of participants. A number of hotel rooms will be available for booking by conference participants at a discounted rate.

The agenda, further details, and registration are available online at http://www.rlg.org/events/rlgjisc2003-agenda.html

 


Publishing Information

RLG DigiNews (ISSN 1093-5371) is a newsletter conceived by the members of the Research Libraries Group's PRESERV community. Funded in part by the Council on Library and Information Resources (CLIR) 1998-2000, it is available internationally via the RLG PRESERV Web site. It will be published six times in 2002. Materials contained in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given for the material in RLG DigiNews to be used for research purposes or private study. RLG asks that you observe the following conditions: Please cite the individual author and RLG DigiNews (please cite URL of the article) when using the material; please contact Jennifer Hartzell, RLG Corporate Communications, when citing RLG DigiNews.

Any use other than for research or private study of these materials requires prior written authorization from RLG, Inc. and/or the author of the article.

RLG DigiNews is produced for the Research Libraries Group, Inc. (RLG) by the staff of the Department of Preservation and Conservation, Cornell University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern; Production Editors, Martha Crowe and Barbara Berger Eden; Associate Editor, Robin Dale (RLG); Technical Researchers, Richard Entlich and Peter Botticelli; Technical Coordinator, Carla DeMello; Technical Assistant, Kimberly Gazzo.

All links in this issue were confirmed accurate as of December 11, 2002.

Please send your comments and questions to RLG Diginews Editorial Staff.

   
 
RLG DigiNews
BROWSE ISSUES
SEARCH
RLG