![]() |
||
| December 15, 2002, Volume 6, Number 6 |
ISSN 1093-5371
|
|
|
Can you recommend any techniques for reducing the incidence of broken external links on my Web site, and for rediscovering resources that have moved? There is conventional wisdom for both parts of this question. In our response we examine some of the subtleties involved in broken link detection and then demonstrate how well some of the oft-recommended strategies for locating lost Web content really work. Web site integrity and preservation are increasing concerns in the library and archival communities, and the subject of several research and development efforts. The answer to this question is informed by work on Cornell's Project Prism, an NSF-funded DLI2 initiative, including investigations of integrity measures for external resources, the testing of tools for Web site monitoring, and the development of techniques for locating lost Web pages by utilizing lexical signatures and document similarity software. Link rot, in which a significant percentage of links no longer work, is a common problem on the Web. It has also been noted that that on any site where links are not maintained or updated on a regular basis, the percentage of outdated links rises rapidly over time. For example, in a link analysis of RLG DigiNews conducted for the fifth anniversary issue, the percentage of bad links rose from approximately 10% after one year to nearly 40% after five years. The causes of link rot are well-known. As sites grow and change, content is often reorganized or moved. In addition, domain names come and go, sometimes disabling all the existing links to a site simultaneously. A variety of techniques, including HTTP redirects, html redirects ("meta refreshes"), and scripting languages such as JavaScript and PHP can be used to transport users relying on obsolete URLs to a new location, with or without prior warning that the content has moved. Other techniques are available to help stabilize Web content. Persistent addressing schemes such as PURL servers, DOIs (Digital Object Identifiers) and URNs (Uniform Resource Names) allow content to move around on local servers without affecting the functionality of existing links. Unfortunately, none of these techniques provides a complete or permanent solution to the link rot problem. Redirects are fairly widely used, but are generally kept active for limited periods of time. Once the redirect is removed, all links that haven't been updated will fail. None of the permanent addressing schemes is widely used yet. In either case, the external links on a site are at the mercy of those who maintain the external sites. Grander solutions have been contemplated. A British startup known as LinkGuard attempted to solve the link rot problem by creating and regularly updating a massive (40 Terabyte) map of all known Web links. The idea was to sell a service that would reference the link map to correct broken links as they were encountered. Unfortunately, the appetite of Web site maintainers to pay for the elimination of broken links proved less voracious than anticipated, and LinkGuard shut down in November 2001, leaving as one of its legacies several hundred bad links to its now defunct linkguard.com domain name. Until another "big idea" comes along to solve the link rot problem once and for all, more modest techniques must be relied upon. The process requires two basic steps. First one must be able to detect when a link has gone bad, or even better, when it's about to go bad. Second, once a link is confirmed bad, one needs a mechanism to find a viable link with which to replace it. Ideally, a software tool would automate both steps and require minimal human intervention. Realistically, available tools for automated handling of the second step are still in their infancy and no one technique is suitable for all situations. Step one: Identifying failed and endangered links Bad external links can be detected using one of the many available link checker software applications or Web services. Typically the product is pointed at a Web page on your site and directed to check all external links. The resulting report will indicate the status of each link, including 404 errors (page not found), or those that required redirection, e.g., http status codes 301 (moved permanently) and 302 (moved temporarily). Sites that can't be reached at all won't return a status code, and may indicate either a temporary service outage or a more permanent loss of an entire site or domain. However, there are typically more failed or endangered links than a simple link check reveals. Some custom 404 messages improperly report an http status code of 200 (indicating a good page), a phenomenon coined "phantom URLs" by Wallace Koehler. [1] Also, redirects that are moderated by meta refreshes or scripts may not be detected by some link checkers. Therefore, a Web monitoring package that can also detect changes in page size or report the presence of particular keywords (such as "moved," "forwarded," "new location," etc.) is desirable. Step two: Safeguarding endangered links and replacing obsolete ones Consider any redirect returned by an external link on your site to be an opportunity to avoid a future broken link. Most redirects exist on a temporary basis. As soon as you become aware that an external link is being redirected, substitute the new destination URL on your site. That's pretty straightforward. But what if an object pointed to by an external link suddenly turns up missing, with no warning? A few years ago, the advice for finding the new location (if one indeed existed) would have included the following: 1. "trim" directories off the end of the URL to get a directory
listing More recently, some additional tricks might be recommended, as well: 4. see if the document is stored in Google's cache How practical is it to use these techniques? How well do they work in reality? We conducted a small, informal study to examine the efficacy of each of these six techniques. Testing techniques for finding missing links For source URLs, we went back to the April 2002 link tests on RLG DigiNews and chose an issue about two years old that had, at the time, eight dead links (404) and two endangered links (302), for a total of ten links, on six unique sites. Nine of the links were to html pages, and one to a pdf (portable document format) document. The results are shown in Table 1. Each link is given a numerical ID# in column 1, with letters appended for those sites that had more than one bad link. Column 2 shows the http status code reported in April 2002, while column 3 shows the http status code in December 2002. All of the eight that were dead back in April remained dead eight months later. Of the two that used redirects, one now comes back as dead, while the other is still a redirect, but the site it redirects to is now dead. In effect, the two redirects have both become dead. As best we were able to determine, nine of the ten pages are still available on live Web sites (column 4) and of those nine, seven can be found in their original domain (column 5). Two could only be located in other domains. We tried to find the original Web page or document by each of the six methods mentioned above (columns 6 through 11). The "success rate" percentages shown in the last row are based on the assumption that each method is evaluated independently. Therefore, even though, for example, it's impossible for a site search to succeed if a site has no internal search engine, we only record that the method failed to find that particular Web page or document. Below the table, we present a detailed analysis of each technique. Some of the techniques described are only viable if the text of the Web page is available. Obviously having a copy in any form (even a printout) makes it easier to recognize that you've found the page referenced in the original link. But some techniques require that a machine-readable copy be available. Even if that copy is part of an archive or cache that you wouldn't want to link to, it may make it possible to use that text to find another copy that is usable as a linking source.
Table 1 Trimming the tree This time-honored technique involves chopping off successively higher branches of the directory "tree" that forms the URL. For example, consider a page located at http://www.xyz.com/level1/level2/level3/file.html. The "trimming" technique would start by removing just the file name to see if a directory of level3 appears. If the desired page or a possible path to it doesn't appear, then chop off level3 and try level2, etc. For our test set, trimming allowed us to rediscover seven of the ten original documents (the other three were no longer available on the original site). However, the process often did not go as originally envisioned. That's because most Web sites no longer allow file directories to be viewed, for security reasons. Thus, in most cases we had to return all the way to the home page and then browse the site, using knowledge of where the document had been previously stored (e.g., on a "publications" page) or which department within the organization originally produced it. Success with this technique depends on both the logical organization of the site and the experience of the searcher. However, it's a low-tech technique that requires only a Web browser and some patience. Site Search Since most documents that go missing have merely moved elsewhere on a site, one would think that site search would be a very effective way to relocate lost documents. In our study, two of the ten documents were on a site lacking site search capability, but even so, site search claimed only a 50% success rate. Three pages didn't turn up because of mis-configured site search engines. Overall, our experience with this technique has been spotty. Besides mis-configurations, we've seen site search features with outdated indexes and poor quality search engines. Also, many sites these days perform site search by sending the search parameters to Google, but restricting the results by domain. Obviously such results will suffer whatever deficiencies Google does with respect to that site, including possibly incomplete or outdated indexing. URL/file name Search This technique involves submitting a portion of the URL (such a directory name or file name) to a general search engine. It is most likely to succeed for fairly distinctive names. We were able to rediscover half of the missing pages this way, using Google as the search engine (though to maximize the odds of finding a lost page, multiple search engines should be tried). For each of the successful searches, the target page was returned as the first hit, even though in some cases, the number of hits was very large (over 200,000 in one case). The other cases failed because in moving to new locations, the file names were changed. Obviously, it is possible to search for a lost page using identifiers other than the directory name or file name. If a copy of the document is available, the title, keywords, and other terms can be used. We will discuss this technique further under lexical signatures, below. Google's cache In recent years, the Google search engine's cache of Web pages has become popular as a means to find copies of missing items, particularly material that has been taken down because it is controversial, embarrassing, or incriminating. The Google cache is most often used following a standard Google search, when the main link to a returned page is inaccessible. However, it is possible to search Google's cache directly. All that is required is to plug the desired URL into a specially formatted Google address. For example, to find the cached version of the October 2002 issue of RLG DigiNews, one would go to http://www.google.com/search?q=cache:www.rlg.org/preserv/diginews/diginews6-5.html. In our study, the Google cache came up empty. None of our missing pages could be found there. This may well be due to the fact that most of the pages we looked for have been gone for at least 8 months. It is not clear how long a typical Google cache file is kept once the original page it represents is no longer available, but Google's cache may be primarily of value for pages that have gone missing fairly recently. Therefore, it might prove most valuable in the maintenance of sites that are being monitored regularly and where the attempt to find the replacement page occurs shortly after the original disappears. Internet Archive Wayback Machine The Wayback Machine became generally available in October 2001 and currently includes archived Web pages starting in 1996 running through the beginning of 2002. Though huge, the Internet Archive (IA), on which the Wayback Machine operates, is not comprehensive (see http://www.rlg.org/preserv/diginews/diginews6-3.html#interview for details on various aspects of the IA). In particular, the IA is always at least 6 months behind in providing access to archived material. Thus, it makes a good complement to Google's cache for finding machine-readable copies of missing Web pages. We were able to find all ten of our test pages in the IA. Since the IA actually encourages sites to link directly to the archived copies in its database, one could conceivably end a search for a missing Web page there. However, such a choice comes with significant drawbacks. Many IA pages are incomplete, particularly missing image content. Also, of course, IA pages are frozen in time. If the content you're pointing to is still being updated, a live link to the current location is obviously preferable. Thus, linking to the IA should be reserved for material of historical interest that is available nowhere else. However, finding the documents in the IA has considerable additional value. For example, it drives home how important it is to do timely updates of links that redirect to other sites. Of the ten pages in our test set, three showed evidence in the IA that they at one time provided redirects to the new location. Added to the redirect (302) that went dead between April and December 2002, fully 40% of the now dead links could have been avoided just by timely maintenance of redirections. If a copy or printout of the page being linked to isn't otherwise available, the pages found in the IA provide text that can be used to more effectively search for a live copy by other means. Titles and keywords can be plugged into a general search engine. We also found that IA URLs can be plugged directly into lexical signature software, the final technique we used to locate the missing pages. Lexical Signatures Lexical signatures are a component of so-called Robust Hyperlinks, a Web link integrity concept developed by Thomas Phelps and Robert Wilensky at UC Berkeley. The notion of robust hyperlinks is quite simple. When linking to a Web page, the linking institution creates a "lexical signature" as part of the URL. Unlike arbitrarily assigned unique IDs, which rely on third party agencies and the voluntary participation of the content's creator, lexical signatures require only that the party linking to a Web page has access to the content. The lexical signature is generated by determining how frequently the terms used in a particular Web page appear on the Web overall. Terms that occur infrequently, relative to the Internet as a whole , are preferred, especially if those terms appear frequently within the page of interest . Lexical signatures can consist of any number of terms, though five appears to be a good compromise between query effectiveness and search effort. A lexical signature may not be unique, but it may be distinct enough to help find the Web page again, should it go missing from its original location. Here's an example of a five-term lexical signature for the October 2002 issue of RLG DigiNews: http://www.rlg.org/preserv/diginews/diginews6-5.html?lexical-signature=societal+modeled+formulate+semantic+populate. The "big idea" behind robust hyperlinks is that with some additional functionality in Web browsers, broken links could, to some degree, made to be self-repairing. A link that returned a 404 error would be automatically searched on one or more search engines using the already attached lexical signature Beta level open source software for generating lexical signatures is available from UC Berkeley. It requires Java 2 v.1.3, which is available for Windows, Linux, and various flavors of Unix. We tested it on a PC running Windows 98. If you're curious about this software, be forewarned that it isn't quite "click and go" and requires a bit of tinkering to get working. The software is not graphical (it runs on the command line) and documentation is minimal. In our small study, lexical signatures helped locate five of the ten missing pages, including a live version that didn't turn up by any other means. Of the five failures, one was a consequence of the document no longer being available, and two resulted from crashes of the software (did we mention that it's not quite ready from prime time?). The other two failures are more interesting, and point more clearly to the weaknesses of this approach. In one case, the Web page (a home page containing introductory text) changed considerably and most of the original lexical signature terms no longer appeared. In the last case, the page had moved to another domain, but was largely unchanged. However, one of the five terms chosen by the lexical signature software no longer appeared on the page and that's all it took to doom the effort. In the cases where the lexical signature software crashed, we decided to try making up our own "lexical signatures" by applying some plain common sense about the English language. In both cases we were able to craft five word searches that brought up the correct new location as the first hit in a Google search. This is why having access to the text of the lost Web page helps, because you can probably assemble search terms based on the title, keywords, or other terms or phrases that will increase the precision of your search. Lexical signatures hold promise because of their potential to automate this otherwise tedious and time-consuming process. Conclusion/recommendations Broken links will continue to be a factor in Web site integrity for the foreseeable future. If you link to external sites and value the resources they represent, the burden to maintain the links is on you. Scan external links regularly. Pages that return redirects should be updated as soon as possible. Be aware that not all redirects are easily detected and that monitoring that goes beyond looking at http status codes may be necessary to find some kinds, as well as to ferret out dead links that masquerade as good. No single technique for rediscovering lost Web content is effective in all cases. Capturing a lexical signature at the time an external resource is linked to, whether one generated by software or handcrafted, can help relocate the resource if it moves without warning. If not done ahead of time, cached or archived copies can be used for the same purpose. In a pinch, even simple site browsing can be surprisingly effective in locating lost resources. [1] Wallace Koehler, "Web Page Change and Persistence—A Four Year Longitudinal Study", Journal of the American Society for Information Science and Technology, v.53, no 2, (January 15, 2002), pp.162-171. [back] --Richard Entlich
Publishing Information RLG DigiNews (ISSN 1093-5371) is a newsletter conceived by the members of the Research Libraries Group's PRESERV community. Funded in part by the Council on Library and Information Resources (CLIR) 1998-2000, it is available internationally via the RLG PRESERV Web site. It will be published six times in 2002. Materials contained in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given for the material in RLG DigiNews to be used for research purposes or private study. RLG asks that you observe the following conditions: Please cite the individual author and RLG DigiNews (please cite URL of the article) when using the material; please contact Jennifer Hartzell, RLG Corporate Communications, when citing RLG DigiNews. Any use other than for research or private study of these materials requires prior written authorization from RLG, Inc. and/or the author of the article. RLG DigiNews is produced for the Research Libraries Group, Inc. (RLG) by the staff of the Department of Preservation and Conservation, Cornell University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern; Production Editors, Martha Crowe and Barbara Berger Eden; Associate Editor, Robin Dale (RLG); Technical Researchers, Richard Entlich and Peter Botticelli; Technical Coordinator, Carla DeMello; Technical Assistant, Kimberly Gazzo. All links in this issue were confirmed accurate as of December 11, 2002. Please send your comments and questions to RLG DigiNews Editorial Staff.
|
||
|
|
|
|
|
|
|
|
|
|
||