![]() |
|
|
|
|
|
|
|
|
|
||
|
||
| December 15, 2002, Volume 6, Number 6 |
ISSN
1093-5371
|
|
|
Editors'
Interview Feature
Article 1 Highlighted
Web Site FAQ
Editor's Interview JPEG 2000 Dr. Daniel Lee Editors’ Note Could you briefly describe JPEG 2000? Why should our readers pay attention to JPEG 2000? Who have the major players been in the development of JPEG 2000? What standards bodies are involved?
How are JPEG 2000 software or vendor products certified as JPEG 2000 compliant?
How well suited is JPEG 2000 for the range of cultural objects (e.g., texts, manuscripts, photographs, art objects, etc.) being digitized by libraries, archives, and museums today? Is it, for example, a good choice for text and other image data that are characterized by edge detail? Is it a good alternative to Group 4 for 1-bit data?
JPEG 2000 offers many features not provided by the current JPEG standard, including support for multiple resolutions, tiling, and region-of-interest coding. Can you describe the advantages of these features and what kinds of cultural heritage images might benefit most from their use?
How good is JPEG 2000's compression efficiency, and how well does it retain image quality with increasing compression? How do these features compare with JPEG and other common image file formats? Does JPEG 2000 offer both lossless and lossy compression modes? Does its compression work well with various bit-depth files?
JPEG 2000 is still a work in progress. Some parts are incomplete, and those sections that have been completed have been subject to several technical corrections and amendments. Is JPEG 2000 ready for still-image users, who require stability and contemplate a long-term commitment to their images?
A growing number of cultural heritage institutions have committed to the long-term maintenance of large image collections. Before migrating to another format, they will want assurance that the new format offers compelling functionality to merit the effort and expense of switching. A number of file formats, (e.g., MrSID, DjVu, LuraWave) offering wavelet compression and some of the same features as JPEG 2000 have appeared in recent years. Is there anything about JPEG 2000 that distinguishes it from its competitors and deserves special consideration as a replacement for these?
Many cultural heritage institutions store their master image files in TIFF format, while using JPEGs to deliver the same images over the Web. Does JPEG 2000 have the potential to offer all these features in a single format and obviate the need to maintain two sets of files for each image? Should cultural heritage institutions consider using JPEG 2000 as a single preservation and delivery format?
What kinds of tools are available right now for the creation, manipulation, and delivery of JPEG 2000 images?
At the Image Compression Symposium, Steve Kerr and Bernie Brower [see resources listed at end] suggested that: “JPEG 2000 is more than a change in compression. It is a paradigm shift in how we collect, store, transmit, and use image information.” How would you respond to that statement?
Could you describe ISO's overarching strategy for using scaleable approaches for archiving and delivering multimedia? How does JPEG 2000 fit into that strategy?
JPEG 2000 offers numerous features not available in JPEG. Yet experience has shown that technological superiority is not necessarily sufficient when it comes to acceptance of new formats. For example, PNG has numerous advantages over GIF, yet it has failed to win large numbers of converts. Other potential formats, such as Flashpix, simply failed to catch on. What is the JPEG 2000 committee doing to ensure that JPEG 2000 will be more quickly and fully embraced by users and toolmakers?
One preservation requirement is to preserve the integrity of the original file (object) over time. A number of institutions have invested in creating "libraries" of TIFF RGB masters. What would happen in a TIFF RGB to JPEG 2000 to TIFF RGB transformation? Would the output match the input? If not, what would potentially be lost?
At the moment, decoding a J2K, JP2, or JPX file with a Web browser requires a plug-in. Widespread use and acceptance of JPEG 2000 will probably require high-quality native support in the mainstream browsers. PNG has had native browser support for over five years, yet some of its best features are still not properly supported in popular browsers. Has Microsoft or AOL (Netscape) committed to providing full native support for any of the JPEG 2000 file formats? If so, when do you expect it to appear?
The official JPEG 2000 Web site indicates that the core coding system is not patent free, though it is intended to be royalty and license-fee free. Should potential users be concerned that future assertion of patent rights may cause problems, such as happened with GIF images?
TIFF is widely used by cultural heritage institutions for master image files. The main TIFF specification hasn't been updated in over a decade. This may a blessing, as regular updates often introduce "feature creep" with resulting incompatibilities. On the other hand, a dead image format will eventually suffer obsolescence. Should users be resigned to the need for periodic migration to new formats? What are the signs that an image format is reaching the end of its life expectancy?
Suggested Resources "Library Potential Impacts," by Steve Kerr and Bernie Brower, Image Compression Symposium. "ISO JPEG 2000 Standards Efforts," by Gordon Ferrari, Image Compression Symposium. "NITFS JPEG 2000 Implementation Schedule/Events," Bandwidth Compression Symposium. "The
Next Generation of Compression JPEG 2000," by Bernie Brower, Image Compression
Symposium. From Oral Tradition to Digital Collectives: Information Access and Technology in Contemporary Native American Culture[1] Kari R. Smith For people who may live both physically and culturally distant from the majority culture in their immediate environment, information technology can provide a boost toward accessing and documenting their own heritage. As early adopters of the Web, Native Americans began using the Internet for e-commerce and cultural outreach in the early 1990s. The University of Michigan School of Information (SI), through internships and workshop classes held since 1997, has been exploring ways that digital technology can facilitate appropriate access and greater participation in cultural heritage documentation and presentation in tribal colleges and communities across the United States. The Cultural Heritage Preservation Institute (CHPI) and its research component, the Digital Collective, were developed by SI Professor Maurita Peterson Holland and the author working with Native American community leaders, educators, cultural experts, and SI graduate students. These efforts culminated in 2001 with an international meeting in Hilo, Hawaii, of indigenous culture and technology specialists; library, museum, and archives professionals; funders; and digital library researchers. At this three-day meeting convened by SI, issues were discussed involving the use of information technology in preserving, documenting, and participating in indigenous cultures. The Cultural Heritage Preservation Institute In 1997 a middle-school teacher on the Navajo Nation asked Holland to consider ways that SI could collaborate with K-12 schools and the tribal college to use information technology to enhance cultural education in the classroom. Holland was a key contact at SI, both as the director of the Academic Outreach Programs and as the faculty member instrumental in coordinating the internships on the Navajo Nation. The author, who had recently returned from a sixteen-week internship at the Navajo Nation with an interest in developing culturally appropriate uses for information technology, worked with Holland as a graduate research assistant. By 1998 we had developed a plan and program for CHPI, a week-long technology and culture workshop for middle-school students and their teachers. Based on the teacher’s needs and our experiences on the Nation, there were several goals we wanted to achieve, among them:
The Institute’s goals were to encourage effective student participation in the information society by providing equipment and technology skills, stimulate interest in a career in information science, encourage use of the Web as a community space, raise the awareness of the tribal college, and encourage the pursuit of higher education by K-12 students. SI acted as institute organizer, technology trainer, and facilitator, with Holland as the project director and the author as the project manager. Using the SI workshop framework, [2] graduate students created technology instructional materials, led discussions on how the projects could fit into classroom teaching, and worked one-on-one with the participants in the creation of the projects. The challenges for SI, as identified in its Final Report and Evaluation, included incorporating information technologies into educational modules for use in a middle-school classroom, bridging the perceived gap between traditional culture and modern technological life (use of information technology to teach about Diné culture), and creating and producing an institute of high cultural integrity and significance for all participants.
The June 1998 institute held in Tsaile, Arizona, at Diné College on the Navajo Nation, was a great success. Twenty-two elementary and middle-school students and their teachers attended the week-long technology and culture workshop. They created educational Web projects and learned technology skills they could share once home. Each student became familiar with Diné College and learned not only about computers, the Internet, and creating Web pages, but also about Diné culture and history. The SI graduate students who taught the technology components gained valuable lessons in working in a challenging IT environment. They also learned about a new nation, culture, and language. Uniquely, the workshop was successful because of the combined approach of applying cultural education to technology skills. Several of the adult participants commented that although they had previously taken workshops on creating Web pages, CHPI was the most successful because there was a purpose to learning the technology skills. Both the teachers and their students received instruction on Internet basics that included browsing, searching, and critical evaluation of Web sites. They also learned how to make basic Web pages, use a Kodak DC210 digital camera, [3] and scan and edit digital images. They developed skills to use the Internet and other digital technology tools to share their heritage with others. By the end of the institute each group of participants created a Web site project based on Diné culture using the information and skills they had acquired.
The participants toured the Ned A. Hatathli Museum on the college campus, which describes and displays collections of Navajo and other Native American artifacts from a Navajo point of view. In addition, they were able to experience Diné cultural heritage throughout the week from demonstrations and lectures by Diné artisans (woodcarving, pottery, basketry, and silversmithing). A guided tour of nearby Canyon de Chelly National Monument gave them a chance to learn about the historical and cultural significance of the canyon and enjoy its natural beauty. One night a local astronomer set up a telescope so the participants could see a special astronomical event and explained the stars and constellations in the Navajo sky. During the final two days of the institute, participants designed and created Web-based projects that were to be the basis for ongoing education and curriculum development of cultural heritage education and community heritage documentation. For example, as part of their project on native plants, students from Kayenta Middle School drew pictures and took digital photographs of actual plants, then discussed native uses and stories of the plants on the Web pages they created.
On the last day of the institute, each participant took part in a public presentation of individual projects in the Diné College Museum attended by the president of Diné College, faculty, and guests. CHPI in Michigan’s Upper Peninsula The second CHPI was held in 1999 in Ann Arbor and in the Upper Peninsula of Michigan. Incorporating feedback and lessons learned from 1998, the institute was presented as two workshops. The first focused on digital technology and the Internet and was held at SI. The second, held in the Upper Peninsula in late June, focused on learning about and documenting Ojibwa cultural heritage. The participants, in the eighth through twelfth grades, were from the Upper Peninsula's Sault Ste. Marie Tribe of Chippewa Indians and the Bay Mills Indian Community.
The CHPI participants in 1999 were Internet and Web savvy. The concerns and
issues they addressed in their online projects were more about content than
process. They were concerned with making sure information available on the Web
was culturally sensitive, accurate, and from the point of view of the community
and that no sacred, secret, or sensitive information was presented. These intentions
were specifically manifested in Web pages created about an ancient tribal burial
site visited during the institute field trip. Each person was careful to make
sure there were no identifiable landmarks in pictures they took and that there
was no information about how to get to the location on their Web pages. Those
precautions were very important because ancient burial sites are often targeted
by artifact hunters who might use the Web to find such sites. Participants also
wanted to use their Ojibwa language and were interested in incorporating audio
files into the Web site in the future. Their projects were all based on the
theme of “Geography, History, and Culture of Michigan’s Upper Peninsula.” The CHPI Becomes a Model After the second institute we wanted to go beyond a project-based approach to documenting culture and adapt our experience gained through CHPI to a more-extensible model for digital access and preservation of culture-based knowledge. As an archivist, the author wanted to build into the CHPI a process to capture, describe, and preserve the digital images, drawings, texts, and new knowledge created during the institutes.
Some of the questions we addressed in our research were: What we developed was a process and model called the Digital Collective: a model for storing and accessing shared information and knowledge, as well as for creating new knowledge and recreating global memory, and a place where people share personal and professional information and where they seek connections and build community. The Digital Collective’s power is in the people’s sharing of perspective, recollections, augmentations, and facts in the language of their culture. [4] In a presentation at the New Information Technologies Conference about digital libraries in China in 2001, we developed five basic principles [5] of the Digital Collective model. Principle One. The digital library must be inclusive of all formats and digital instantiations. In so doing, it must adjust its definition to attend to new forms of information that exist only in virtual space. Principle Two. Institutions of memory must work together to carry out their responsibility. Principle Three. Research about learning will reshape how people use information and create new knowledge. Principle Four. The Web offers two-way communication and publication. Every user of information is also a potential producer. Therefore, stakeholders can become collaborators in describing and sharing artifacts and experience. Principle Five. Employing principles one through four, we can define a new digital library model: the Digital Collective.
The Digital Collective is not a digital library, nor
a virtual museum, nor an electronic archive. It is a complex system for storing,
describing, accessing, and using digitized multiformat materials. Unlike most
existing digital libraries or virtual museums that are created and populated
by experts from institutions, the Digital Collective is a community space where
nonexperts and ordinary people can enter their digital objects along with their
information, stories, and experiences about their own or other objects in the
collective database. The Digital Collective is special because it uses experts to catalog, describe,
organize, and produce products from the multimedia objects in a Digital Collective,
as well as nonexperts who have personal knowledge and objects and are interested
in sharing these, in drawing connections among other objects, and in interacting
with others regardless of time or space. Another distinctive benefit of the
Digital Collective is its inherent focus on digital objects and therefore born-digital
objects. Most institutions of memory that are creating databases for digital
access to their collections are digitizing physical objects for greater or remote
access. The Digital Collective model includes born-digital objects created
by artists, writers, and musicians, as well as physical objects that have been
digitized. Another important aspect of the Digital Collective is its attempt to collect
current objects and information, rather then wait until materials reach an archive,
library, or museum. By allowing people to self-select materials to contribute
to a collective and add their own knowledge about materials already in it, the
Collective becomes a record of society itself in addition to a rich repository
of objects and descriptive information. We recognized a need for global discussion about initiatives to document and
preserve indigenous culture and extend the conversation about the Digital Collective
model. Working with Professor Daniel Atkins at SI, and with funding from the
National Science Foundation and further support from the W. K. Kellogg Foundation,
in 2001 SI convened a meeting, facilitated by The Grove Consultants, to share
ideas and explore connections about the role of information technology in celebrating
and extending culture. [6] Invited attendees of the two-day meeting in Hilo, Hawaii, included Native North
Americans (including Alaskan, Hawaiian, and Canadian); Maori, Australian Aboriginals,
Sami, and Brazilian and African participants; relevant cultural institutions;
academic experts in research and technology; and potential funders. All attendees
shared an interest in cultural preservation, education, and sharing through
information technology. Each of the thirty-two attendees were asked five key questions before arriving
at the meeting:
The discussions at the meeting centered on the topic “The Use of the Web in Indigenous Communities.” Based on initial discussion, three threads of interest were identified: preservation, technology, and networking. Participants illustrated their vision of the effect that the application of information technology could have on indigenous communities. Following that, a discussion of themes and threads drew together the critical points.
On the second day, the participants shared “what I can do” and “what I have to offer.” This led to defining major topics for discussions in working groups: 1) developing a guide for indigenous communities that want to use the Web and 2) building a global indigenous library.[7] Several successful collaborations have taken place since the meeting. In addition, there have been discussions at funding agencies about developing new approaches to support indigenous projects. Work is continuing at SI and other institutions in applying the Digital Collective model for cultural preservation and access. The experience reported in this paper began for the author in summer 1997 with a sixteen-week SI-sponsored internship on the Navajo Nation at Navajo Community College—now Diné College—in Tsaile, Arizona. [8]
In addition to working on various information management and archives projects at the college and with the Navajo Nation Library System, in Window Rock, Arizona, there was ample opportunity to travel around the twenty-six thousand square mile reservation to learn about and visit other cultural collections and museums and archives. A few of the places that impressed me in terms of their rich collections and potential were the Tuba City Museum, the Window Rock Tribal Museum and Library, and the Burger King in Kayenta with its WWII Code Talkers exhibit of original artifacts. This exhibit is perhaps the richest private collection on permanent display of WWII Navajo Code Talkers. The fact that it’s located in a fast-food restaurant means that a fair number of people visiting Monument Valley see it, but only people who visit Kayenta have this opportunity. In addition, there were various exhibitions and collections in trading posts across the Nation. Through conversations with people in many towns and trading posts, I was introduced to many rich depositories of information and cultural resources, as well as varying levels of technology used in the storage, organization, and dissemination of those resources. By the time I left the Nation, it was apparent that a directory of some sort would be useful for locating cultural collections across the Navajo Nation.
I was fortunate that my internship was coordinated by Professor Holland. Some of the other projects of SI the author and Holland worked on in collaboration with tribal colleges in the U.S. include: "Alternative Spring Break" library and archival graduate students at Native American Educational Services College in Chicago. Fourteen students spent a week processing archival materials and documenting and advising on information technology for the college campus. School of Information and the American Indian Higher Education Consortium. Internet public library technology customized for and by each tribal college. Crow Nation, Montana—Little Big Horn College. "Alternative Spring Break" and archival internships at the Tribal College Library and Archives.Lessons Learned During the four years that I worked with Native and indigenous communities, libraries, archives, and tribal colleges, I learned several valuable lessons.
Future Challenges for the Profession Indigenous people must be included in developing standards and technologies at national and international levels. Education and training issues, such as funding, the ability to enter established programs, and time commitments need to be addressed. Increased access to indigenous knowledge has to be weighed carefully against the potential harvesting of that knowledge for commercial use. Archives, libraries, and museums need to understand what they have in their collections, what ownership rights exist, and how and when the Native American Graves Protection and Repatriation Act (NAGPRA) may apply. [1] This paper is modified from a presentation for the panel "Creating Web Access to the Cultural Record" at the Society of American Archivists Annual Conference in Birmingham, Alabama, in August 2002. Special thanks are due to Professor Maurita P. Holland, teacher, mentor, and colleague, for all her support. Much of this work was supported by funding from the W. K. Kellogg Foundation, the Microsoft Foundation, and the National Science Foundation, in addition to the School of Information at the University of Michigan in Ann Arbor, Michigan. [back] [2] The W. K. Kellogg Foundation granted SI funds to re-invent the education of library schools for the twenty-first century. Part of this includes the Practical Engagement Program (PEP) is an integral part of SI's professional master's program. Designed to integrate the application of knowledge and skills to specific problems outside the classroom, PEP both enables and requires students to combine what they have learned in the classroom with what they observe and experience in the "real world." [back] [3] Digital cameras and software for each school were provided through a grant from the W. K. Kellogg Foundation. [back] [4] Holland, Maurita P., and Kari R. Smith. "Using Information Technology to Preserve and Sustain Cultural Heritage: The Digital Collective," in the 2000 UNESCO World Culture Report, (UNESCO Publishing: Paris, 2000). [back] [5] Holland, Maurita P., and Kari R. Smith, "What the Digital Library Doesn’t Tell You: Exploring the Gaps and Opportunities," p. 89 in Global Digital Library Development in the New Millennium, ed. by Ching-chih Chen (Beijing: Tsinghua University Press, 2001). [back] [6] The meeting was cosponsored by the American Indian Higher Education Consortium and the Smithsonian Institution Museum of the American Indian. [back] [7] Digital Collectives in Indigenous Cultures and Communities, report of the meeting held August 10 and 11, 2001, in Hilo, Hawaii. Copies of the report may be requested from: School of Information, Office of Academic Outreach, University of Michigan, 304 West Hall, 550 East University Avenue, Ann Arbor, Michigan 28109-1092.[back] [8] Diné College was established in 1968 as the first tribally controlled community college in the United States. In creating an institution of higher education, the Navajo Nation sought to encourage Navajo youth to become contributing members of the Navajo Nation and world society. [back] Highlighted Web Site
Can you recommend any techniques for reducing the incidence of broken external links on my Web site, and for rediscovering resources that have moved? There is conventional wisdom for both parts of this question. In our response we examine some of the subtleties involved in broken link detection and then demonstrate how well some of the oft-recommended strategies for locating lost Web content really work. Web site integrity and preservation are increasing concerns in the library and archival communities, and the subject of several research and development efforts. The answer to this question is informed by work on Cornell's Project Prism, an NSF-funded DLI2 initiative, including investigations of integrity measures for external resources, the testing of tools for Web site monitoring, and the development of techniques for locating lost Web pages by utilizing lexical signatures and document similarity software. Link rot, in which a significant percentage of links no longer work, is a common problem on the Web. It has also been noted that that on any site where links are not maintained or updated on a regular basis, the percentage of outdated links rises rapidly over time. For example, in a link analysis of RLG DigiNews conducted for the fifth anniversary issue, the percentage of bad links rose from approximately 10% after one year to nearly 40% after five years. The causes of link rot are well-known. As sites grow and change, content is often reorganized or moved. In addition, domain names come and go, sometimes disabling all the existing links to a site simultaneously. A variety of techniques, including HTTP redirects, html redirects ("meta refreshes"), and scripting languages such as JavaScript and PHP can be used to transport users relying on obsolete URLs to a new location, with or without prior warning that the content has moved. Other techniques are available to help stabilize Web content. Persistent addressing schemes such as PURL servers, DOIs (Digital Object Identifiers) and URNs (Uniform Resource Names) allow content to move around on local servers without affecting the functionality of existing links. Unfortunately, none of these techniques provides a complete or permanent solution to the link rot problem. Redirects are fairly widely used, but are generally kept active for limited periods of time. Once the redirect is removed, all links that haven't been updated will fail. None of the permanent addressing schemes is widely used yet. In either case, the external links on a site are at the mercy of those who maintain the external sites. Grander solutions have been contemplated. A British startup known as LinkGuard attempted to solve the link rot problem by creating and regularly updating a massive (40 Terabyte) map of all known Web links. The idea was to sell a service that would reference the link map to correct broken links as they were encountered. Unfortunately, the appetite of Web site maintainers to pay for the elimination of broken links proved less voracious than anticipated, and LinkGuard shut down in November 2001, leaving as one of its legacies several hundred bad links to its now defunct linkguard.com domain name. Until another "big idea" comes along to solve the link rot problem once and for all, more modest techniques must be relied upon. The process requires two basic steps. First one must be able to detect when a link has gone bad, or even better, when it's about to go bad. Second, once a link is confirmed bad, one needs a mechanism to find a viable link with which to replace it. Ideally, a software tool would automate both steps and require minimal human intervention. Realistically, available tools for automated handling of the second step are still in their infancy and no one technique is suitable for all situations. Step one: Identifying failed and endangered links Bad external links can be detected using one of the many available link checker software applications or Web services. Typically the product is pointed at a Web page on your site and directed to check all external links. The resulting report will indicate the status of each link, including 404 errors (page not found), or those that required redirection, e.g., http status codes 301 (moved permanently) and 302 (moved temporarily). Sites that can't be reached at all won't return a status code, and may indicate either a temporary service outage or a more permanent loss of an entire site or domain. However, there are typically more failed or endangered links than a simple link check reveals. Some custom 404 messages improperly report an http status code of 200 (indicating a good page), a phenomenon coined "phantom URLs" by Wallace Koehler. [1] Also, redirects that are moderated by meta refreshes or scripts may not be detected by some link checkers. Therefore, a Web monitoring package that can also detect changes in page size or report the presence of particular keywords (such as "moved," "forwarded," "new location," etc.) is desirable. Step two: Safeguarding endangered links and replacing obsolete ones Consider any redirect returned by an external link on your site to be an opportunity to avoid a future broken link. Most redirects exist on a temporary basis. As soon as you become aware that an external link is being redirected, substitute the new destination URL on your site. That's pretty straightforward. But what if an object pointed to by an external link suddenly turns up missing, with no warning? A few years ago, the advice for finding the new location (if one indeed existed) would have included the following: 1. "trim"
directories off the end of the URL to get a directory listing More recently, some additional tricks might be recommended, as well: 4. see if the document
is stored in Google's cache How practical is it to use these techniques? How well do they work in reality? We conducted a small, informal study to examine the efficacy of each of these six techniques. Testing techniques for finding missing links For source URLs, we went back to the April 2002 link tests on RLG DigiNews and chose an issue about two years old that had, at the time, eight dead links (404) and two endangered links (302), for a total of ten links, on six unique sites. Nine of the links were to html pages, and one to a pdf (portable document format) document. The results are shown in Table 1. Each link is given a numerical ID# in column 1, with letters appended for those sites that had more than one bad link. Column 2 shows the http status code reported in April 2002, while column 3 shows the http status code in December 2002. All of the eight that were dead back in April remained dead eight months later. Of the two that used redirects, one now comes back as dead, while the other is still a redirect, but the site it redirects to is now dead. In effect, the two redirects have both become dead. As best we were able to determine, nine of the ten pages are still available on live Web sites (column 4) and of those nine, seven can be found in their original domain (column 5). Two could only be located in other domains. We tried to find the original Web page or document by each of the six methods mentioned above (columns 6 through 11). The "success rate" percentages shown in the last row are based on the assumption that each method is evaluated independently. Therefore, even though, for example, it's impossible for a site search to succeed if a site has no internal search engine, we only record that the method failed to find that particular Web page or document. Below the table, we present a detailed analysis of each technique. Some of the techniques described are only viable if the text of the Web page is available. Obviously having a copy in any form (even a printout) makes it easier to recognize that you've found the page referenced in the original link. But some techniques require that a machine-readable copy be available. Even if that copy is part of an archive or cache that you wouldn't want to link to, it may make it possible to use that text to find another copy that is usable as a linking source. Table 1 Trimming the tree This time-honored technique involves chopping off successively higher branches of the directory "tree" that forms the URL. For example, consider a page located at http://www.xyz.com/level1/level2/level3/file.html. The "trimming" technique would start by removing just the file name to see if a directory of level3 appears. If the desired page or a possible path to it doesn't appear, then chop off level3 and try level2, etc. For our test set, trimming allowed us to rediscover seven of the ten original documents (the other three were no longer available on the original site). However, the process often did not go as originally envisioned. That's because most Web sites no longer allow file directories to be viewed, for security reasons. Thus, in most cases we had to return all the way to the home page and then browse the site, using knowledge of where the document had been previously stored (e.g., on a "publications" page) or which department within the organization originally produced it. Success with this technique depends on both the logical organization of the site and the experience of the searcher. However, it's a low-tech technique that requires only a Web browser and some patience. Site Search Since most documents that go missing have merely moved elsewhere on a site, one would think that site search would be a very effective way to relocate lost documents. In our study, two of the ten documents were on a site lacking site search capability, but even so, site search claimed only a 50% success rate. Three pages didn't turn up because of mis-configured site search engines. Overall, our experience with this technique has been spotty. Besides mis-configurations, we've seen site search features with outdated indexes and poor quality search engines. Also, many sites these days perform site search by sending the search parameters to Google, but restricting the results by domain. Obviously such results will suffer whatever deficiencies Google does with respect to that site, including possibly incomplete or outdated indexing. URL/file name Search This technique involves submitting a portion of the URL (such a directory name or file name) to a general search engine. It is most likely to succeed for fairly distinctive names. We were able to rediscover half of the missing pages this way, using Google as the search engine (though to maximize the odds of finding a lost page, multiple search engines should be tried). For each of the successful searches, the target page was returned as the first hit, even though in some cases, the number of hits was very large (over 200,000 in one case). The other cases failed because in moving to new locations, the file names were changed. Obviously, it is possible to search for a lost page using identifiers other than the directory name or file name. If a copy of the document is available, the title, keywords, and other terms can be used. We will discuss this technique further under lexical signatures, below. Google's cache In recent years, the Google search engine's cache of Web pages has become popular as a means to find copies of missing items, particularly material that has been taken down because it is controversial, embarrassing, or incriminating. The Google cache is most often used following a standard Google search, when the main link to a returned page is inaccessible. However, it is possible to search Google's cache directly. All that is required is to plug the desired URL into a specially formatted Google address. For example, to find the cached version of the October 2002 issue of RLG DigiNews, one would go to http://www.google.com/search?q=cache:www.rlg.org/preserv/diginews/diginews6-5.html. In our study, the Google cache came up empty. None of our missing pages could be found there. This may well be due to the fact that most of the pages we looked for have been gone for at least 8 months. It is not clear how long a typical Google cache file is kept once the original page it represents is no longer available, but Google's cache may be primarily of value for pages that have gone missing fairly recently. Therefore, it might prove most valuable in the maintenance of sites that are being monitored regularly and where the attempt to find the replacement page occurs shortly after the original disappears. Internet Archive Wayback Machine The Wayback Machine became generally available in October 2001 and currently includes archived Web pages starting in 1996 running through the beginning of 2002. Though huge, the Internet Archive (IA), on which the Wayback Machine operates, is not comprehensive (see (RLG DigiNews June 2002) for details on various aspects of the IA). In particular, the IA is always at least 6 months behind in providing access to archived material. Thus, it makes a good complement to Google's cache for finding machine-readable copies of missing Web pages. We were able to find all ten of our test pages in the IA. Since the IA actually encourages sites to link directly to the archived copies in its database, one could conceivably end a search for a missing Web page there. However, such a choice comes with significant drawbacks. Many IA pages are incomplete, particularly missing image content. Also, of course, IA pages are frozen in time. If the content you're pointing to is still being updated, a live link to the current location is obviously preferable. Thus, linking to the IA should be reserved for material of historical interest that is available nowhere else. However, finding the documents in the IA has considerable additional value. For example, it drives home how important it is to do timely updates of links that redirect to other sites. Of the ten pages in our test set, three showed evidence in the IA that they at one time provided redirects to the new location. Added to the redirect (302) that went dead between April and December 2002, fully 40% of the now dead links could have been avoided just by timely maintenance of redirections. If a copy or printout of the page being linked to isn't otherwise available, the pages found in the IA provide text that can be used to more effectively search for a live copy by other means. Titles and keywords can be plugged into a general search engine. We also found that IA URLs can be plugged directly into lexical signature software, the final technique we used to locate the missing pages. Lexical Signatures Lexical signatures are a component of so-called Robust Hyperlinks, a Web link integrity concept developed by Thomas Phelps and Robert Wilensky at UC Berkeley. The notion of robust hyperlinks is quite simple. When linking to a Web page, the linking institution creates a "lexical signature" as part of the URL. Unlike arbitrarily assigned unique IDs, which rely on third party agencies and the voluntary participation of the content's creator, lexical signatures require only that the party linking to a Web page has access to the content. The lexical signature is generated by determining how frequently the terms used in a particular Web page appear on the Web overall. Terms that occur infrequently, relative to the Internet as a whole , are preferred, especially if those terms appear frequently within the page of interest . Lexical signatures can consist of any number of terms, though five appears to be a good compromise between query effectiveness and search effort. A lexical signature may not be unique, but it may be distinct enough to help find the Web page again, should it go missing from its original location. Here's an example of a five-term lexical signature for the RLG home page: http://www.rlg.org?lexical-signature=encoded+archival+permissions+eureka +becoming. The "big idea" behind robust hyperlinks is that with some additional functionality in Web browsers, broken links could, to some degree, made to be self-repairing. A link that returned a 404 error would be automatically searched on one or more search engines using the already attached lexical signature. Beta level open source software for generating lexical signatures is available from UC Berkeley. It requires Java 2 v.1.3, which is available for Windows, Linux, and various flavors of Unix. We tested it on a PC running Windows 98. If you're curious about this software, be forewarned that it isn't quite "click and go" and requires a bit of tinkering to get working. The software is not graphical (it runs on the command line) and documentation is minimal. In our small study, lexical signatures helped locate five of the ten missing pages, including a live version that didn't turn up by any other means. Of the five failures, one was a consequence of the document no longer being available, and two resulted from crashes of the software (did we mention that it's not quite ready from prime time?). The other two failures are more interesting, and point more clearly to the weaknesses of this approach. In one case, the Web page (a home page containing introductory text) changed considerably and most of the original lexical signature terms no longer appeared. In the last case, the page had moved to another domain, but was largely unchanged. However, one of the five terms chosen by the lexical signature software no longer appeared on the page and that's all it took to doom the effort. In the cases where the lexical signature software crashed, we decided to try making up our own "lexical signatures" by applying some plain common sense about the English language. In both cases we were able to craft five word searches that brought up the correct new location as the first hit in a Google search. This is why having access to the text of the lost Web page helps, because you can probably assemble search terms based on the title, keywords, or other terms or phrases that will increase the precision of your search. Lexical signatures hold promise because of their potential to automate this otherwise tedious and time-consuming process. Conclusion/recommendations Broken links will continue to be a factor in Web site integrity for the foreseeable future. If you link to external sites and value the resources they represent, the burden to maintain the links is on you. Scan external links regularly. Pages that return redirects should be updated as soon as possible. Be aware that not all redirects are easily detected and that monitoring that goes beyond looking at http status codes may be necessary to find some kinds, as well as to ferret out dead links that masquerade as good. No single technique for rediscovering lost Web content is effective in all cases. Capturing a lexical signature at the time an external resource is linked to, whether one generated by software or handcrafted, can help relocate the resource if it moves without warning. If not done ahead of time, cached or archived copies can be used for the same purpose. In a pinch, even simple site browsing can be surprisingly effective in locating lost resources. [1] Wallace Koehler, "Web Page Change and Persistence—A Four Year Longitudinal Study", Journal of the American Society for Information Science and Technology, v.53, no 2, (January 15, 2002), pp.162-171. [back] --Richard Entlich
Calendar of Events Sixth International Open Forum on Metadata RegistriesJanuary 20-24, 2003 Santa Fe, NM The Open Forum for 2003 will present standards, tutorials, and practical experience about the following technologies: • Universal Description, Discovery, and Integration (UDDI) • OASIS/ebXML Registries • ISO/IEC 11179 Metadata Registries • Database Catalogs (e.g., relational DBMS/SQL) • Software Development Repositories • Software Component Registries • Terminology and Ontological Registries, and • Dublin Core Registries The theme will be cooperation and interoperation of these technologies, and the conference will highlight the management of metadata and data semantics for Web services, XML data exchanges, data management, and other applications. Web-Wise Conference on Libraries and Museums in the Digital
World Practical
Experiences in Digital Preservation Announcements
New Publications on
Digital Preservation NINCH Guide to Good Practice
International
Standard Being Developed for Archiving PDFs The
Public Record Office (PRO): Making the Nation’s Memory Available Online RLG News Joint RLG-JISC
Symposium: Selection and Collaboration in Digital Preservation This joint symposium, sponsored by RLG and the UK Joint Information Systems Committee (JISC) and hosted by the Library of Congress, will address the critical issues of selection and collaboration in preserving digital materials. Digital preservation is an international issue with many transferable lessons. Leading speakers from the USA and Europe will describe their experiences and future plans. Through presentations, discussions and breakout groups, participants will have opportunities to contrast different approaches, consider which approaches will be relevant for their own institution and interests, and to further explore opportunities for collaboration in digital preservation across organizational and national boundaries. This event is the latest in a series of collaborations between JISC and RLG begun in 1996 and resulting in conferences, research projects and publications. Attendance at the conference itself will be free to staff of RLG member institutions and UK HE/FE institutions but numbers are limited and early booking is advised. We anticipate a small number of places will also be made available to others to attend on a cost recovery basis (non- members of UK HE/FE institutions or of RLG should register their interest in attending on the conference web page). Accommodation, travel and meals (other than conference lunches and refreshments) will the responsibility of participants. A number of hotel rooms will be available for booking by conference participants at a discounted rate. The agenda, further details, and registration are available online at http://www.rlg.org/events/rlgjisc2003-agenda.html
Publishing Information RLG DigiNews (ISSN 1093-5371) is a newsletter conceived by the members of the Research Libraries Group's PRESERV community. Funded in part by the Council on Library and Information Resources (CLIR) 1998-2000, it is available internationally via the RLG PRESERV Web site. It will be published six times in 2002. Materials contained in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given for the material in RLG DigiNews to be used for research purposes or private study. RLG asks that you observe the following conditions: Please cite the individual author and RLG DigiNews (please cite URL of the article) when using the material; please contact Jennifer Hartzell, RLG Corporate Communications, when citing RLG DigiNews. Any use other than for research or private study of these materials requires prior written authorization from RLG, Inc. and/or the author of the article. RLG DigiNews is produced for the Research Libraries Group, Inc. (RLG) by the staff of the Department of Preservation and Conservation, Cornell University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern; Production Editors, Martha Crowe and Barbara Berger Eden; Associate Editor, Robin Dale (RLG); Technical Researchers, Richard Entlich and Peter Botticelli; Technical Coordinator, Carla DeMello; Technical Assistant, Kimberly Gazzo. All links in this issue were confirmed accurate as of December 11, 2002. Please send your comments and questions to RLG Diginews Editorial Staff.
|
||
|
|
|
|
|
|
|
|
|
|