Vicky Reich, Stanford University Libraries
David S. H. Rosenthal, Sun Microsystems Laboratories
LOCKSS (Lots Of Copies Keep Stuff Safe) is a prototype of a system to preserve access to scientific journals published on the Web.
Problem
In most respects the Web is a far more effective medium for scientific, technical and medical (STM) communication than paper. Techniques such as datasets in spreadsheets behind graphs, dynamic lists of citing papers, e-mail notification of citing papers are commonplace. These, plus basic hyperlinks and full text searching, make Web journals easier to access and more useful than paper journals. Material frequently appears earlier online than on paper. Many journals publish material, including peer-reviewed articles, only online. The paper journal no longer serves as the archive of science.
Librarians have confidence in their ability to provide readers with access to material published on paper, even if it is centuries old. Preservation is a by-product of the need to distribute copies to provide access. Paper journal subscriptions provide libraries with an archival copy of the content. Librarians are skeptical about their ability to provide long-term access to materials published on the Web. Subscribing to a Web journal rents access to the publisher’s copy. The publisher may promise "perpetual access", but there is no business model to support the promise. This poses a problem for librarians, who wish to provide both current and future readers with access to published literature.
Librarians want and need to provide their readers with long-term access to selected content. Doing so involves solving three problems:
- The bits themselves must be preserved. All digital storage media have a limited lifetime; the bits need to migrate from one medium to another over time. In practice it is difficult to fund a bulk copying effort when a medium starts decaying. Only the most valuable bits are preserved at each transition.
- Access to the bits must be preserved. Unless links to pages continue to resolve, the material will effectively be lost because it’s likely few will have the patience, knowledge, or resources to retrieve it.
- The ability to translate the bits, once accessed, into human-readable form must be preserved.
There can be no single solution to these problems. A single solution by itself would be perceived as vulnerable. By proposing LOCKSS we are not arguing that other solutions should not be developed and deployed. Diversity is essential to successful preservation.
Requirements
Librarians’ techniques for preserving access to published paper materials have been honed since 415 AD, when much of the world’s then-current literature was lost in the Library of Alexandria fire. These techniques include: acquire lots of copies; distribute them around the world so it is easy to find some of the copies but hard to find all of the copies; lend or replicate your copies when other libraries need access.
In this context, we make a distinction between preserving archives (not LOCKSS) and preserving general circulating collections (LOCKSS). Archives house unique materials that are expensive and/or impossible to replicate widely. Access is restricted to protect the artifacts and to ensure preservation. Circulating library collections provide access, and explicit risks are taken with each artifact to achieve this goal. Copies are loaned to readers on a promise that they will eventually be returned.
Libraries’ circulating collections are a model fault-tolerant distributed system. As a whole, the system is highly replicated and far more reliable than any individual component. There is no single point of failure, no central control to be subverted. There is a low degree of policy coherence between the replicas, and thus low systemic risk. The desired behavior of the system as a whole emerges as the participants take actions in their own local interests and cooperate in ad-hoc, informal ways. Librarians are more likely to have confidence in an electronic system if it works in a familiar way.
The fundamental requirement for LOCKSS is to model the paper system for web published materials. If libraries can take physical custody of purchased journals, in a form that preserves reader access, they can assume responsibility for future use. If a library takes custody of a copy of a Web journal, the copy can behave as a Web cache and provide access whether or not it is available from the original publisher. If many libraries do this, the caches can communicate with each other to increase the reliability and availability of the service, as inter-library loan increases the reliability and availability of access to information on paper.
Another perspective on LOCKSS is that it provides librarians with journal access insurance. The reasons journals might become inaccessible include: Subscription cancellations; changes in publisher access policies (a not-for-profit journal is acquired by a for-profit publisher); publisher ceases publication; and incompetent management of a publisher’s web service.
Librarians balance the cost of preserving access to old material against the cost of acquiring new material. They tend to favor acquiring new material. To be effective, subscription insurance must cost much less than the subscription itself. The biggest journal on the HighWire Press web site (http://highwire.stanford.edu/) generates about six gigabytes per year. A cheap PC to hold five years’ worth might cost $600 today, which is about 10% of the subscription for the 5 years. If the running costs of the system can be kept low enough, it should now be practical for many libraries to maintain their own copies. The prospects for this insurance improve as equipment prices fall and subscription prices rise.
Open source software development is a crucial project requirement: One goal of LOCKSS is to inspire confidence. It is hard to have well-founded confidence in a system the operations of which are kept secret. The system’s economics mandate free distribution of the software; there is barely a budget for the hardware. The longevity of the system will require many generations of programmers to refine it as problems are encountered.
Technology
The design goal for LOCKSS is to provide librarians with a cheap and easy way of running Web caches which:
Collect content into the cache as new issues of the journals are published;
Serve content to readers from either the publisher or from the cache;
Preserve the contents of the cache for posterity by never flushing it.
The design capitalizes on two features of STM Web journals: peer- reviewed articles are immutable once published, and journal web sites have a logical structure. We are not designing a general-purpose Web content preservation system; LOCKSS is not suitable for volatile content. LOCKSS is work in progress; the design will evolve as we gain experience.
Collect
A librarian instructs an instance of LOCKSS to preserve a volume of a journal by providing the publisher’s root URL for the volume and a frequency of publication. At that frequency a web-crawler starts from the root URL and fetches all new pages within that sub-tree. The publisher’s web server sees this access as coming from an authorized IP address, so it is allowed. Readers need not access the material to populate the cache. This component of LOCKSS uses off-the-shelf technology - the w3mir crawler.
Serve
The prototype uses the Apache web server to export the contents of each cache to the local network’s users.
Preserve
The heart of LOCKSS is a peer-to-peer inter-cache protocol we call LCAP [Library Cache Auditing Protocol]. It runs continually but very slowly between all the caches. LCAP allows the caches to agree on which URLs should exist and what their contents should be. If a cache discovers a missing or damaged URL it can fetch a new copy via HTTP from the original publisher, or from one of the other caches. Care is taken not to subvert the publisher’s access control mechanism; content is delivered only to sites that have rights to it.
The process to detect and repair damage to cached content is:
- As the time since the last integrity check of selected journal content increases, it becomes more and more likely that one of the caches will call for a new check. These checks are similar to opinion polls in which caches vote.
- The caller of the poll challenges other caches to prove they have the same content as the caller using a digital hashing technique. Other caches will respond to the challenge by computing appropriate hash values and replying. The caches hearing the replies will tally the poll. If they are on the winning side their cache is intact. If they are on the losing side, their cache contains some damage. The damaged cache divides the content in question into sections and calls a poll on each section to zero in on the location of the damage. When a damaged file is located a new copy is fetched from the publisher, if the publisher still exists, or from one of the winners.
A detailed description of the LCAP protocol is available in a paper presented at the 2000 Usenix Technical Conference, [http://lockss.stanford.edu/freenix2000/freenix2000.html].
LOCKSS is strong in some unusual ways:
- There is no central coordination point that can be attacked.
- It doesn't depend on the Domain Name System, or a Public Key Infrastructure.
- Provided enough other participants preserve the journal articles a participant can corrupt or lose any or all of its information. The lost content will be inaccessible to local readers for a while but will eventually re-appear.
- There are no passwords or encryption keys to be kept secret.
- The system makes it easier to detect an attacker and limits the rate at which he can damage preserved information.
Testing
An alpha-test of LOCKSS has been underway since July involving about 4 months of Science Online and 15 machines at Stanford University, University of California Berkeley, Columbia University, University of Tennessee, Los Alamos National Laboratory, and Harvard University.
The test has established that the system can be installed easily, that the LCAP protocol works over the Internet, that a new cache can collect the specified content and agree with others that it is correct, and that the system can detect and repair deliberate damage.
We plan to assess this test, incorporate the experience and run a beta test at a much more realistic scale beginning late Winter 2001. Goals for the beta test include investigating the system’s performance, making estimates of operational costs to libraries and publishers, and evaluating attempts by a "red team" to subvert the system.
Libraries that have agreed to participate include: Stanford University, University of California Berkeley, Columbia University, University of Tennessee, Los Alamos National Laboratory, Harvard University, British National Library, Carnegie Mellon University, Cornell University, Emory University, Library of Congress, University of Chicago, University of Indiana, University of Leeds, University of Maastricht, University of Melbourne, University of Minnesota, University of Texas Austin, Yale University, Universidad Nacional Autonoma de México, University of Oklahoma HSC.
We’re looking for more libraries outside the US to take part.
Publishers who are supporting the beta test include: American Association for the Advancement of Science, American Physical Society, Federation of American Societies for Experimental Biology, Biophysical Society, Annual Reviews, Rockefeller University Press, American Society for Biochemistry and Molecular Biology, American Association for Clinical Chemistry, National Academy of Sciences, British Medical Journal, American Psychiatric Publishing Inc., Oxford University Press, Company of Biologists Ltd, New England Journal of Medicine, American Society for Clinical Investigation, Radiological Society of North America, Society for General Microbiology, The Endocrine Society, The Histochemical Society, American Thoracic Society, BMJ Publishing Group, American Society of Neuroradiology, Lipid Research Inc., American Society for Investigative Pathology, American Society of Plant Physiologists
Future Plans
Production System
Plans for a production system will be decided once alpha and beta testing is complete. It will incorporate whatever improvements prove necessary through the beta test, and will be packaged and optimized for minimal effort of installation and tending. Three important performance metrics for LOCKSS once it is deployed in production are: What does it cost a library to run it? How often does the system as a whole lose or corrupt journal articles? What is the probability that a reader will encounter a missing or corrupt article?
It would be very valuable to have multiple independent implementations of the LCAP protocol. All monocultures are vulnerable, and if deployed en masse LOCKSS would be a monoculture. A bug in the implementation could wipe out information system-wide. We hope that by keeping the protocol very simple we will encourage other implementations. The source code will be released under a Stanford equivalent of the University of California at Berkeley license [http://lockss.stanford.edu/softwarelicense.htm].
Beyond Journals
The suitability of LOCKSS for applications other than journals is being explored. In fact, any body (whether or not growing) of relatively immutable and structured documents addressable through HTTP is a candidate for distributed preservation through LOCKSS. Obvious examples may be found among various corpora of government documents in electronic form; LOCKSS could easily serve as the basis for electronic equivalents to "depository library" systems.
Acknowledgments
Heartfelt thanks to Michael Keller, the Stanford University Librarian for his support and encouragement; to Michael Lesk at the National Science Foundation for funding the project with grant IIS-9907296; to Sun Microsytems Laboratories, which has provided both time and funds; to AAAS and our alpha sites, and to Demian Harvill, Information Systems Project Manager, HighWire Press, Stanford University Libraries.