![]() |
||||||
| February 15, 2003, Volume 7, Number 1 | ISSN 1093-5371 |
|||||
|
Risk Management for Web Resources: A Case Study on Southeast Asian Web Sites Peter Botticelli In recent years libraries and other cultural institutions have become increasingly concerned about the tendency for Web sites to lose content over time, especially those that are managed informally and without strong institutional backing. Cornell's Project Prism has been exploring ways to detect risks to Web resources as the first step toward developing a toolset for managing risks without necessarily requiring libraries to capture and archive the Web resources themselves. Thus, over the past year we have been monitoring Web sites and documenting changes in their status that may indicate short- and long-term risks to content. In early 2002 Allen Riedy, the curator of Cornell Library’s Echols Collection on Southeast Asia, offered the Project Prism team a sample list of fifty-four Web sites of political and nonprofit organizations covering Southeast Asia (hereafter referred to as the "Asia sites") that he considered valuable for long-term preservation as a natural extension of the library’s world-class Southeast Asia holdings. The sites were chosen because they were rich in timely, original content on political issues that was subject to major changes as events unfolded in the region. All the sites chosen by Riedy had been cataloged and made available through the library’s catalog, as in the example below, for http://www.orchestraburma.org:
Many of the Asia sites represent political parties or advocacy groups for such causes as human rights, government reform, independence for indigenous peoples, and environmental protection. Among the sites dedicated to Myanmar (Burma), for instance, is a site, http://www.dassk.com, representing Daw Aung San Suu Kyi, the prominent dissident leader. Riedy noted that the selection and cataloging of Web resources like the Asia sites required a significant investment of time by library staff and that methods and policies are urgently needed to ensure their long-term viability. Thus we have spent the past year studying risk management issues for the Asia sites as part of a larger effort involving test sets of other Web sites. The content of the Asia sites represents seven different countries: Cambodia, East Timor, Indonesia, Laos, Malaysia, Myanmar, and the Philippines. At the outset we hoped to track the physical location of the servers used for each site. But we discovered that the available information on domain name owners and domain servers was fragmentary at best. We did find anecdotal evidence that a significant number of the Asia sites may actually be managed or published outside Southeast Asia, in the U.S., Europe, Japan, and Australia, for instance. Almost half the domain names in this study were originally registered in the U.S., for instance, and, through queries to the various "whois" databases, we were able to link roughly a third of the domain servers with American or other Western ISPs. For risk management purposes, it would be a great advantage to have a complete up-to-date registry for sites intended for long-term preservation.[1] In monitoring the Asia sites, we were given access to Mercator, a powerful Web crawler developed by researchers at Compaq (now Hewlett-Packard). In crawling sites, we used a strict “politeness” algorithm rigorously designed to avoid overloading servers with requests for pages. And we did not attempt to crawl any pages with robot.txt directives, which are commonly used to exclude crawlers. In addition, our study was designed for “passive” monitoring only, using data that was freely available on the Web, and without making any contact with the owners or system operators for the Asia sites. All Web crawlers are programmed to search the Web and download pages according to predetermined criteria. Thus every crawl begins with, a "seed URL" as a starting point, and a set of URL filters designed to limit the crawl to a desired site or domain within the Web. The following example illustrates how we set these criteria for crawling the Asia sites.
In this case the seed URL is the home page for this site. The filter has two parts, limiting the crawl to pages in the "bigpond.com.kh" host domain (a Cambodian ISP), and specifically in the subdirectory labeled "ngoforum," the particular site we hoped to monitor. Changes in a URL, especially in its host domain, were an ongoing problem in our monitoring efforts. For instance, between November and December 2002, one site (http://www.easttimorpress.qut.edu.au) moved its base of operations from Australia to East Timor, and hence the site was renamed http://www.easttimorpress.com. The site began as a community service project at Queensland University of Technology in Australia. After East Timor became an independent country in May 2002, students and staff went to Dili to train a group of Timorese journalists to run the site themselves. Our last two crawls of the old site, indicated below, yielded a page linking to the new site, the URL for which we were able to document automatically. This was important given the fact that by January 20, 2003, the old site no longer functioned. Thus, through regular crawling we were able to document the changing provenance of this site, as well as to reveal the possible risk of content loss as the site’s organizational structure changed due to evolving political circumstances.
*Both the old and new URLs were crawled on 1/14/03, at which time http://www.easttimorpress.qut.edu.au still registered a single 200. As a test, we recrawled the old URL a week later and discovered that it was now a 404. Using Mercator, we crawled each of the Asia sites ten times in roughly an eight-month period, between late April 2002 and January 2003. We were able to successfully crawl all of the sites at least once, with one exception: http://www.freemalaysia.com, which was apparently shut down sometime between late January 2002 (the last available entry we were able to find in the Internet Archive) and the time of our first crawl, in April 2002. [2] Once Mercator has crawled a site, it automatically generates a series of reports derived from the HTTP and HTML data that a server returns any time a Web page is requested by a client. We programmed Mercator to capture the HTTP status code for each discovered page, the full set of HTTP headers, and the full text of the HTML source for each page. We were particularly interested in HTTP headers as potential sources of information for risk management, given that they were designed in part to ensure that cached Web pages are complete, authentic, and up-to-date. We have also begun to examine HTML META (metadata) elements as possible sources of information for risk management. Besides our interest in META tags, a colleague, Hye Yeon Hann, has carried out a pilot study in which she documented the incidence of HTML tags used for dynamic features such as applets, scripts, and interactive forms. In the near future we plan to automate our data gathering on the full spectrum of HTML elements and thus to compare the reliability of pages with dynamic or multimedia features versus all other pages. The most basic data we've been able to track is the number of Web pages for each site. None of the Asia sites is very large by Web standards, as the chart below indicates. Half of the sites have between 100 and 1,000 pages, while only a quarter have more than 1,000. Taken as a whole, the Asia sites consist of just over 70,000 total pages. By contrast, we discovered about 1.4 million pages (averaging over 12,000 pages per site) while crawling a list of library Web sites for members of the Association of Research Libraries.
Despite the relatively small size of the Asia sites, preserving 70,000 Web pages across fifty-four different sites is clearly a nontrivial matter, making it necessary to monitor these sites closely and to have a robust and at least partially automated set of tools, including a powerful crawler, to detect risks and, as much as possible, to rescue content before permanent losses occur. Our work with Mercator is part of a larger effort to identify and classify risks to Web resources and ultimately to enforce preservation policies developed for specific collections and types of resources.[3] At the outset of the project we were naturally concerned that a collection like the Asia sites might have a very high rate of content loss, raising the potential cost of implementing effective risk management for these types of resources. In the course of crawling, which began at the end of April 2002, we were able to record the disappearance of four sites (besides http://www.freemalaysia.com, as noted above). Two sites, http://www.orchestraburma.org (recall the catalog entry for this site, above) and http://www.partikeadilan.org, were lost by our second crawl on June 6, 2002 (our first crawl was on April 30, 2002), and months later we documented the fact that these domains had been acquired by an American ISP and a pornography site, respectively. A third site, http://www.barisanalternatif.org, was lost by our third crawl, in July 2002. As of January 2003 this domain was available for sale. A fourth site, http://www.laskarjihad.or.id, became defunct by our November 2002 crawl. This site represented the militant Islamic group Laskar Jihad, which disbanded immediately before the October 2002 bombing of a nightclub in Bali.[4] The examples above highlight the complexities of preserving information in the dynamic and transitory environment of the Web. They also highlight the value of automated data-gathering methods, including crawling, supplemented by other, more qualitative sources, in discovering risks to the integrity of Web-based materials. In spite of the handful of catastrophic losses we encountered, our overall results indicate a significant but relatively low failure rate for Web page downloads, as indicated by HTTP status codes. On average, across ten crawls we found that 92% of all pages discovered returned a 200 ("OK").[5] Of the remainder, 7% were reported as 404 ("Not Found") errors, and the other 1% were a combination of server errors (500s), socket-level errors (failure to connect to server), access-restricted pages (401, 403), robot-excluded pages (blocked from crawling), and redirected pages (300s). We also found that a significant percentage of the Asia sites had relatively low rates of missing pages.[6] In every crawl at least one-third of the sites were missing less than 1% of their pages. In seven crawls at least one-fifth of the sites had no 404s whatsoever. In our January 2003 crawl we found that twelve sites have at least 99% of their pages return a 200 ("OK") code and no more than 1% of their pages as 400 (client error) codes. While more work is needed to test the completeness of pages having a 200 code, our data thus far suggests that a significant percentage of the Asia sites are generally reliable in responding to requests for Web pages. Also, we found that the sites that were registered in Southeast Asia had only a slightly higher rate of missing pages than sites registered in the U.S. or other Western countries, as the table below indicates.
However, we did find many sites exhibiting danger signs for content loss. On average, across ten crawls we found that about one-fifth of the sites had at least 10% of their pages missing and that 10% of the sites were missing at least 20% of their pages. In our latest crawl we found five sites missing 25-40% of their total pages. These potential losses should be viewed in absolute as well as relative terms. Thus, one site was recently missing 1,005 out of a total of 9,206 pages, or 11%. Another site was missing 690 out of 1,740 pages, or 40%. For large Web sites, even a small percentage of missing pages could mean thousands of pages lost. We've also found it interesting to track changes in the number of pages for each site discovered by Mercator over successive crawls. As we gather more data over time, we are correlating these numbers with other potential indicators of risk. For instance, in January 2003 we found that one site had shrunk 91% from our previous crawl in December. Three other sites showed declines between 11% and 38%. Taking these four sites together, our crawl results show a potential loss of about 2,000 pages out of nearly 7,500 total pages. Besides decreases in pages discovered for sites, we've also documented substantial increases in the number of pages discovered from crawl to crawl. In January 2003 we found two sites whose page totals grew by 34% (adding 1,000 pages) and 56% (adding 53 pages) from the previous month. Although an increase in pages does not by itself indicate a risk factor for a site, it could be a sign of organizational changes that might put some older content at risk. As we gather more data, we plan to investigate possible correlations between such changes in the composition of a site and possible risks of content loss. Changes in the number of pages discovered for a site can indicate major organizational risks. For instance, after East Timor received its independence in May 2002, the East Timor government site, http://www.gov.east-timor.org, showed a drastic decline in pages discovered as shown in Table 3. Closer examination revealed that the site was under construction and its previous content off-line. As the site continued to exist, it was not clear that the old content had been discarded, but it clearly was in danger of being lost to the user. A similar phenomenon occurred after the 2000 Presidential election. In late 2000, George W. Bush's transition team instructed all federal agencies to remove information from their Web sites specifically related to the Clinton administration. Cases like this show the value of regular monitoring of valuable Web sites, as early detection of organizational changes may leave enough time to negotiate archiving agreements with the owners of a still-functioning site. We are also working on methods for capturing the full content of Web pages (including dynamic features and images automatically linked to pages) as part of our crawling routine, making it possible to preserve a complete "snapshot" of a site at a point in time. That way, if a site disappears, we would be able to archive the last available version of the site. Moreover, as we discover particular risks to sites (e.g., a major change in a country’s political climate), we could step up crawling for affected sites to increase the likelihood that we could archive sites before they disappeared.
In the case of the East Timor government site, we immediately decided to investigate the site to determine what had changed after examining the results of our second crawl (6/6/02) for this site. Since both pages discovered by the crawl were 200s ("OK"), we had to look at the HTML source to find evidence of what had happened; namely, that the site was under construction. In successive crawls we were not able to explain the changes we detected between 8/12/02 and 10/3/02, except that the site was apparently still under construction. By December 2002 the site's home page indicated that the site was again fully functional, although it had obviously undergone substantial changes from April of that year. While we could not say that the site's old content had been lost, especially since we were monitoring the site passively, there was clearly a substantial risk of content loss owing to the magnitude of change in the number of pages on the site. Besides tracking the number of pages and their HTTP status codes, we programmed Mercator to capture and report the HTTP headers that servers automatically send in response to requests for Web pages. We were interested in determining which headers were used and how frequently, because data of this type may indicate how well managed Web resources are, as well as provide information that may prove valuable in risk management efforts over time. Taking the Asia sites as a whole, we discovered fourteen total headers in use by all the sites, as indicated below. The table gives the number of instances of each header relative to the total number of pages detected by Mercator for this phase of the study.[7]
The results above closely matched those of the other collections we tested,
although the Of the fourteen header types we found in use by the Asia sites, we believe that the five listed below may be particularly valuable for risk management, although more study is needed to assess the consistency and reliability of information typically provided in these fields.
The Content-Type header is particularly important because it is used to record the MIME type, or "media type," as it is sometimes described, for a Web document. In our December crawl of the Asia sites we found documents representing thirty-two different media types, as listed below.
Even a quick glance at the above list reveals many potentially at-risk media types. But then, we should keep in mind that 97.8% of all pages in this sample consist of just four very common document types: HTML, GIF and JPEG images and PDF files. However, if we are concerned with preserving audio and video files, for instance, we have to be concerned with seven different formats, in spite of the fact that they make up less than 1% of the total number of pages in this sample. Given the potential cost of preserving out-of-date media types, it is important to monitor the use of at-risk formats and whenever possible to encourage Web site creators to consider migrating documents to common or standard formats. Our data shows the complexity of Web resources and the need for institutions to choose their priorities carefully in deciding what content needs to be preserved and what risks are most pressing at any given time. Our goal in tracking the Asia sites has been to identify and monitor preservation risks and to provide comparative data on the organizational and technical integrity of Web sites. Thus far we have been able to use the full spectrum of HTTP data provided by Web servers, and at present we are refining a set of tools to analyze the HTML source for each page, which will enable us to track links and to identify dynamic elements embedded in pages that may add to the risk of content loss over time. In general, our results from the Asia sites highlight the need for a focused approach to preserving distributed information resources on the Web by gathering risk data about sites on a regular basis, classifying and comparing risks as they are discovered, and developing a robust set of tools to rescue content before it is permanently lost. Ultimately, Project Prism envisions a comprehensive system for risk management and preservation of Web resources, giving libraries the power to maintain lasting virtual collections out of diverse resources they may not own or control directly. We believe that as patrons come to depend more on all types of online information, libraries can add substantial value to Web resources by identifying and actively managing risks of content loss on the Web. Acknowledgements Footnotes [2] This site had been under political pressure from the Malaysian government, as it published many reports critical of senior officials. See http://news.bbc.co.uk/2/hi/asia-pacific/country_profiles/1304569.stm and http://www.thestar.com.my/news/storyx1000.asp?file=/1999/8/9/nation/0908eede&sec=. (back) [3] See "Preservation Risk Management for Web Resources: Virtual Remote Control in Cornell's Project Prism," D-Lib Magazine (9) 1 (January 2002). (back) [4] See http://news.bbc.co.uk/2/hi/asia-pacific/770263.stm. (back) [5] We should point out that 200 codes can effectively mask content losses in sites that are programmed to generate a page containing an error message indicating that content is missing. For a detailed discussion of the problem and its possible solutions, see http://www.rlg.org/preserv/diginews/v6_n6_faq.html. (back) [6] By "missing pages" I'm actually referring to URLs that result in error codes in the 400 (client error) and 500 ("server error") ranges, as well as pages that result in socket-level errors (unable to connect to server). Strictly speaking, the presence of a 404 ("Page Not Found") error, for example, does not necessarily mean that a page is lost, as it may still exist under a different URL. But from the users' point of view, a bad URL with no redirect or other information provided is functionally lost content. Hence, for the purposes of this study we decided to label pages "missing" if we were unable to locate them by their URL, which most users still depend on to identify and distinguish Web pages. We are currently investigating alternative methods for identifying pages without relying on URLs, though from an archival point of view, more work is needed to show if we can determine the provenance of Web pages in the absence of clear information provided by a page's creator. (back) [7] The number of pages in this sample is less than the total pages we discovered for the Asia sites because Mercator was only able to report HTTP headers for pages having the text/html MIME type. (back)
Publishing Information RLG DigiNews (ISSN 1093-5371) is a newsletter conceived by the members of the Research Libraries Group's PRESERV community. Funded in part by the Council on Library and Information Resources (CLIR) 1998-2000, it is available internationally via the RLG PRESERV Web site. It will be published six times in 2003. Materials contained in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given for the material in RLG DigiNews to be used for research purposes or private study. RLG asks that you observe the following conditions: Please cite the individual author and RLG DigiNews (please cite URL of the article) when using the material; please contact Jennifer Hartzell, RLG Corporate Communications, when citing RLG DigiNews. Any use other than for research or private study of these materials requires prior written authorization from RLG, Inc. and/or the author of the article. RLG DigiNews is produced for the Research Libraries Group, Inc. (RLG) by the staff of the Department of Research, Cornell University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern; Production Editor, Martha Crowe; Associate Editor, Robin Dale (RLG); Technical Researchers, Richard Entlich and Peter Botticelli; Technical Coordinator, Carla DeMello; Technical Assistant, Valerie Jacoski. All links in this issue were confirmed accurate as of February 15, 2003. Please send your comments and questions to RLG Diginews Editorial Staff.
|
||||||
| |
|
|
|
|
|
|
|
|