
Editors’ Note:
E-journal archiving is receiving a lot of attention these days. We are pleased to present this editors’ interview with Victoria Reich, Director of the LOCKSS Program, who will describe recent work associated with the deployment of the LOCKSS (Lots of Copies Keep Stuff Safe) technology. Later this spring, we’ll offer an editors’ interview with Eileen Fenton, Executive Director of Portico, another e-journal archiving initiative.
First there was the LOCKSS system. You launched the LOCKSS Alliance in early 2005. In January 2006, you announced the CLOCKSS initiative. Would you provide a brief description of each and compare/contrast their contributions to community efforts in digital preservation ?
The LOCKSS (Lots of Copies Keep Stuff Safe) Program was motivated by the emergence of important scholarly information on the Web and a basic belief that a core of a library is its collections. The LOCKSS software offers libraries a cost effective and easy way to build digital collections of Web-based content. Digital information is extremely fragile and preservation must start from the moment it is put into circulation.
The LOCKSS Program has worked to introduce new best practices and reinforce others. Specifically, the LOCKSS Program holds that the following concepts are required for a robust digital preservation system:
- Replicate the content in independent repositories.
- Audit the bits and bytes. Digital content is fragile. Highly reliable off-line back up is very expensive. If files are continuously compared and damage automatically repaired, off-line back up can be eliminated.
- Don't touch! Minimize processing, migrate on demand, and preserve presentation and look and feel.
- Open source software is critical. Provide the community with a transparent mechanism to confirm processing claims. Guard against dependence on limited, centralized technical expertise.
- Allow no single points of failure. Strive for diversity in administration, funding, and technology.
- Have extremely cost-effective processes.
The LOCKSS system is format-agnostic and will collect and preserve content in any format that can be delivered over the Web (i.e., via HTTP), has a stable URL structure, and changes at a moderate pace. The system is transparent and preserved content is delivered to browsers exactly as the original publisher would have delivered it provided the browser will accept that format.
The single greatest threat to materials being preserved over the long term is money. Societies will have good times and bad. Keeping content safe must be a marginal expense in order to decrease the threats during bad times as well as to maximize available funds for new acquisitions during good times.
The LOCKSS Alliance, launched in 2005, is a library membership organization. The Alliance is governed by a library board [1] and advised by a publisher committee. Digital preservation is best done as a collaborative, community effort. This allows community control over the direction of the technology and applications. We’ve already seen the impact of this approach. Together the LOCKSS Alliance members have directed development of the system and are using the software to build local collections and preserve a wide variety of formats and genres. In addition to subscription and open access electronic journals, LOCKSS Alliance members are collecting and preserving government documents, electronic thesis and dissertations, websites, in-house image collections, and, soon, books and blogs.
The community is demonstrating strong support for this approach. In 2005, through word of mouth and grass roots outreach, the LOCKSS Alliance garnered two thirds of the community support needed to be self-sustaining. As the Alliance community grows, the fees for LOCKSS Alliance membership will decrease: “many hands make light work.”
The CLOCKSS (Controlled Lots of Copies Keep Stuff Safe) Initiative is designed to test the feasibility of a large, community-managed dark archive. The member librarians and publishers [2] are working together, as equals, to develop frameworks for publisher disaster failover systems and processes for providing public access to orphaned materials. CLOCKSS institutions range from 65 to 500 years old and have successfully met sustainability and survival challenges. Our work will be open and transparent, and there will be full public disclosure of CLOCKSS operations, governance, and technology. During this initial two-year project, CLOCKSS members will be working towards implementing a production system.
What do the terms “preservation” and “archive” mean in the LOCKSS Program context?
The fundamental goal of a digital preservation system is that the content stored in the system remains accessible to future readers for a time much longer than the lifetime of any individual component of the system. The National Research Council's study for the National Archives points out that “…the designers of a digital preservation system need a clear vision of the threats against which they are being asked to protect their system’s contents, and those threats under which it is acceptable for preservation to fail.” In the LOCKSS threat model, economic threats are a major focus. The technology is designed to minimize an institution’s total cost of ownership over the long haul and avoid front-loading the costs that do arise. For a comprehensive discussion of the threats to digital preservation see our recent D-Lib article.
Communities are using the LOCKSS digital preservation software to build archives. There are a number of examples, and among the best known are the NDIIPP-funded MetaArchive Project, the ASERL Electronic Thesis and Dissertations Project, and the CLOCKSS Initiative.
What kind of LOCKSS policies and procedures are there?
Our overarching policy is to keep procedures as simple as possible and still get the job done.
Here are some brief examples:
- Human intervention is expensive and prone to error. Automate!
- Secrets are dangerous. Aim for 100% transparency and audit ability in everything.
- Legal frameworks change and paper trails are not reliable. Bundle and preserve legal rights and restrictions with the content to the greatest possible extent.
- Guard against technical arrogance. Build and use open source software and nurture a critical and contributing technical community. The LOCKSS software is fully documented and available on SourceForge.net. New updates are released about every six weeks. It’s easy to install a LOCKSS box.
- Collection development is a local activity.
- Building publisher relationships is a collaborative activity. The LOCKSS Alliance community is documenting practices for working with external content providers, such as electronic journal publishers and government entities (state and federal). Members are documenting practices for working with providers of internal content, for example electronic thesis and dissertations, manuscripts, and special collections.
- Our industries’ business relationships have worked fairly well for 500 years—endeavor to strengthen these, not to break them.
What is the nature of the relationships established with publishers? How does the LOCKSS manifest page work?
The LOCKSS Program encourages publishers to publish and libraries to take custody of the record of scholarship. The relationships between library and publisher are well-established and LOCKSS keeps this two-party relationship.
Many publishers are comfortable with LOCKSS Alliance member libraries exercising their traditional social obligation as long-term cultural memory organizations. The LOCKSS Alliance members are major US and UK research libraries and are governed by strong intellectual property practices.
Publishers have complete control over which libraries they permit to preserve, what specific content they can preserve, and when they take custody of that content. Publishers must give permission to libraries to preserve content on the LOCKSS system. They give permission by posting a “LOCKSS publisher manifest page” on their website for each “archival unit” to be preserved. In traditional journals, an archival unit is a volume. A LOCKSS manifest page provides:
- A set of links that the crawler will follow to find content to collect.
- A permission statement saying that the LOCKSS system has permission to collect and preserve the content found by following the links.
Publishers post the manifest page with the full text of the content behind the publisher’s access control wall. This ensures that only subscribers have permission to preserve the content. This permission is bundled with and preserved with the content. Furthermore, each LOCKSS box is clearly identified both by an “identification tag” and an IP address.
Through the manifest page mechanism, publishers give permission to authorized libraries to preserve their content. The use and access restrictions of subscription-based content are governed by the original license agreement. If a publisher’s terms and conditions are fairly stable across customer bases, we urge them to put rights and restrictions on the LOCKSS publisher manifest page so this information is bundled with and preserved with the content.
A manifest page is not always required. The LOCKSS system ingests content from websites that support OAI-PMH with permission but without the need for a manifest page full of links. For open access publishers, we encourage use of an appropriate Creative Commons license. This machine-readable license is then preserved with the content, making clear current and future rights. Also, in response to community requests, we are currently working on methods to preserve blogs. We expect that in most cases this will be done via RSS. Other ingest mechanisms may also be implemented, such as Google SiteMaps.
Are publishers’ permissions to preserve content revocable?
Once a publisher has given a library permission to preserve an archival unit and the library’s LOCKSS box is preserving that content, the publisher cannot take back the content. This is analogous to the library’s buying a paper journal and putting it on the shelf. The publisher doesn’t walk into the library and say, “I want back my journal run!” The publisher can, however, prevent further and future preservation activity by not putting online new publisher manifest pages and/or by removing manifest pages for back content.
What kind of content is captured from publisher websites and how is it stored?
A LOCKSS box collects and stores everything that a reader can see on a publisher's website and stores it bit-for-bit as it is received. The LOCKSS system preserves the look and feel of a work, or the “performance” of a work. The publishing world is moving away from merely replicating printed look and feel for the Web environment. In the mid 1990s, major peer-reviewed science journals began to include supplemental materials (databases, movies, images), and media-rich humanities journals, such as Exquisite Corpse and Vectors, are now proliferating.
How does the polling process work?
Each LOCKSS box collects its content independently. Each box with a given collection then compares the content it collected with others that collected the same content through the polling process, which involves voting on the hash of the content. If they reach consensus, we treat the consensus as definitive; the boxes that were in the minority re-fetch content from the publisher and repeat the comparison. If after repeated attempts a box is still in the minority, the software sends an alert to its operator.
Since publishers do not (for good technical reasons) sign the Web pages they publish, if multiple versions of a page are collected from a publisher's website, there is no algorithmic way to know which is authentic; some human intervention is necessary. Note that each box stores every version of a page that it receives, although versions that don't reach consensus are regarded as suspect.
How many LOCKSS boxes are needed for a reliable vote?
When building private LOCKSS networks, for example the MetaArchive project, we recommend a minimum of 6-7 replicas and expect all of them to take part in every poll. The production LOCKSS system has over 100 boxes, although not all of them have all possible content collections. We currently set system parameters so that a typical poll in this system has 7-10 votes.
How much redundancy is needed?
The question of redundancy is somewhat misleading in the LOCKSS context. Each library collects and preserves content for its own readers alone; it does not in general have permission to disseminate the content to unaffiliated readers. The LOCKSS system does not support file sharing, and this is a major reason why publishers are prepared to cooperate. Thus, the number of copies of given content in the LOCKSS system as a whole is a side effect of the individual collection development processes at the member libraries. It is not a global system management decision.
That being said, each library may set a threshold number of copies that it considers adequate for reliability. If that library’s box fails to find many other copies existing on other libraries’ LOCKSS boxes, it will alert the operator to persuade fellow librarians to collect the content in question to aid in its survival.
The LOCKSS system has been criticized for excessive redundancy. It is true that if your goal is to use disk resources as efficiently as possible, minimizing the number of replicas consistent with reliability would be a sensible policy. Someone who believes the cost of disk storage is the major cost in the system might chose such a policy. Fortunately disk storage is among the least expensive costs in the system. The LOCKSS system is carefully designed to minimize the use of expensive resources, such as a lawyer’s billable hours and attention from skilled system administrators, by using other, cheaper resources, such as disks, lavishly.
How does the LOCKSS system ensure, over long periods, that the content in the system is the same as when it was originally collected from the publisher?
Once consensus has been achieved through the polling process and the canonical version from the publisher has been determined, LOCKSS boxes compare their content regularly with others. There are two possible reasons for disagreement at this stage (after initial consensus has been achieved):
- Random damage to an individual LOCKSS machine’s content
- A malicious attempt to alter preserved content
The probability that a significant number of LOCKSS boxes would suffer identical damage in the period between audits is extremely small. Thus we distinguish between “incoherent,” or random damage and “coherent,” or malicious damage. Incoherent damage is detected and repaired automatically since the damaged box will lose a poll by a landslide. The techniques for detecting and defeating attempts at coherent damage are complex and are described in an award-winning paper presented at ACM's 2003 SOSP workshop.
Briefly, it is extremely improbable that an attacker would know enough about the state of the system to alter the preserved content without being detected. Note that because the LOCKSS boxes are independently administered, these defenses work even against attacks by “insiders,” such as authorized administrators of individual LOCKSS boxes. Insiders are a leading cause of computer abuse incidents and should be a part of every digital preservation system’s threat model.
How can defective (failing media, under attack, insecure) LOCKSS boxes be detected?
Each LOCKSS box treats all other individual LOCKSS boxes with total suspicion. LOCKSS boxes take actions only on the basis of a consensus of a sample of the other boxes. The integrity of the system does not depend on identifying faulty or subverted boxes and taking immediate action to exclude them. See SOSP 2003 for further details.
Have you been able to determine the causes of files needing repair?
We have detected failures at many levels of the system, for example ingest, disk storage, and the machines themselves. For more details on the reliability of digital storage, see our paper to appear at Eurosys 2006.
All components of an information system are unreliable in the timescales needed for digital preservation. It must be a fundamental design principle of any digital preservation system that it be capable of tolerating failures of any of its components. Thus, it would not be a criticism of the LOCKSS system, or any other digital preservation system, that it had not yet detected and repaired any errors. The fault tolerance mechanisms are necessary ab initio; they do not need to be justified by experience over a limited time span. Rather, it is the absence of fault tolerance mechanisms that is a criticism.
Would you give us some examples of technical developments in the works?
LOCKSS boxes currently communicate using a protocol we call LCAP V1, which is essentially unchanged since the early LOCKSS prototype. During 2006, we will be gradually phasing out V1 and replacing it with a new protocol, LCAP V3, which is now being tested. The design of V3 is the result of a four-year effort funded in part by a major grant from the National Science Foundation and involving more than a dozen researchers from the labs of Sun Microsystems, Intel, Hewlett-Packard, and the computer science departments of Stanford and Harvard. This team has presented major papers at ACM’s SOSP workshop, in ACM's Transactions on Computer Systems, at the Usenix conference, at the upcoming Eurosys conference, as well as many workshop papers.
The LOCKSS system currently ingests content via Web crawling and OAI. The community has asked us to expand this to include RSS so a wider variety of materials (blogs and other more rapidly evolving content) can also be collected and preserved.
A third initiative is to provide usage and content delivery statistics to LOCKSS box administrators. Both the publishers and librarians are interested in the frequency that a user is unable to reach the publisher’s server.
What components of OAIS does the LOCKSS system address? Does the LOCKSS system address quality control? What metadata goes along with digital objects in the LOCKSS system?
The LOCKSS system complies fully with the requirements of the OAIS standard. See the formal compliance statement on the LOCKSS website for details.
The LOCKSS system has automated audit mechanisms to ensure that (a) what is preserved is the consensus of many independent collections from the publisher, (b) what is available now still represents the consensus of many independent replicas, and (c) any deviations from these quality standards are reported to the operators of the LOCKSS boxes concerned.
The LOCKSS system collects all metadata available from the publisher. We are working towards integrating automated metadata extraction technology, both for format metadata (JHOVE) and bibliographic metadata (perhaps using technology from Rexa). As MacKenzie Smith has pointed out, human-generated metadata is so costly that dependence on it can be an Achilles Heel of digital preservation. The LOCKSS system is designed to be as cheap as possible for libraries to operate and, therefore, uses automated processes for all normal operations. Human intervention is a last resort.
What happens to current content if the LOCKSS Program ceases to exist (e.g., succession planning for trusted digital repositories)?
Each library owns and operates its own LOCKSS box. If at any time they want to switch to an alternate system for digital preservation, they have the technical capability to extract the entire content of their box in the exact form that it was originally collected. The content can then be submitted to an alternate system. Whether they have legal permission to do so would depend on the copyright holder.
What are the costs to institutions for using a LOCKSS box? What resources are required (e.g., time, programming) for working with publishers, adding a title, maintaining the LOCKSS box, etc.? What are your costs for maintaining and developing the current system?
Institutions are welcome to download the free, open source software and install it on whatever computer is appropriate for their particular application. The cost of working with publishers is directly proportional to the size of the publisher. Bringing a LOCKSS box online involves downloading the software, burning a CD, and running an installation script that takes about 8.5 minutes. There is the cost of the LOCKSS box; library LOCKSS boxes are ordinarily PCs. We urge libraries to join the LOCKSS Alliance so they can easily and cost effectively have a direct hand in building and preserving digital collections for future readers.
By policy, the costs for the LOCKSS team at Stanford are not growing. As mentioned earlier, it is extremely important for a digital preservation system to eliminate any central points of failure. A particularly expensive and vulnerable point of failure is technical expertise. As the system grows, we are growing an international, open source community.
How scalable are the LOCKSS boxes? How many objects can one LOCKSS box handle before audit checking overwhelms it?
A LOCKSS box needs a balance between computational bandwidth and storage capacity. We are presently evaluating state-of-the-art, low-cost storage boxes from Capricorn Technology that cost about $3500 for 2TB of storage. Although they are far less powerful than typical desktop or even laptop PCs, they appear to have adequate computational bandwidth to handle 2TB of content. More powerful and expensive hardware could handle more content per node, but at a higher cost/byte.
We have simulated systems of up to 1000 LOCKSS boxes. This is significantly larger than any real world system.
How are you dealing with format obsolescence?
We have currently implemented the necessary framework for format migration and demonstrated that it works. Our next steps are to provide an API to which format converting plugins can be written and to create a registry of converter implementations.
How are you addressing obsolescence of the LOCKSS box itself?
The LOCKSS software comes in two parts. The LOCKSS daemon is written in Java and requires only a standard Java virtual machine to run. We routinely run it on OpenBSD, Linux, and MacOS X. It can also run on Windows and FreeBSD. Given the huge volume of mission-critical software with similar requirements, the commitment of major vendors such as Sun Microsystems and IBM, and the availability of open-source JVM implementations, obsolescence of Java is not an immediate concern. Nevertheless, the LOCKSS team has a medium-term goal of persuading some other team(s) to write independent implementations of the LOCKSS protocol. Doing so would improve the reliability of the system even if Java were never to become obsolete.
The LOCKSS platform requires a generic PC and OpenBSD. The economics of technical markets make early obsolescence of the generic PC not an immediate concern. OpenBSD has an established history, but we routinely use other operating systems and if it became necessary could easily switch the entire system to use another operating system.
What kinds of initiatives are institutions interested in developing?
The library community is interested in collecting and preserving a wide variety of digital materials with a tool they can develop and manipulate to serve their local communities cost-effectively over the very long term.
During 2005 the following kinds of materials were collected and preserved on either the international or local LOCKSS system networks: special collections (including in-house digitized collections), websites, institutionally published materials, government documents (state, federal), electronic thesis and dissertations. Major projects that are using LOCKSS technology to preserve materials other than electronic journals are:
What does the future look like for the LOCKSS Program?
Very bright and extremely busy!
What will success look like for the LOCKSS Program, the LOCKSS Alliance, and the CLOCKSS Initiative?
However long you watch a digital preservation system, you can never be sure it will provide access when it is needed in the future. The LOCKSS Program, the LOCKSS Alliance, and the CLOCKSS Initiative are working to increase the odds that future readers will find today’s content when they need it.
Notes
1. LOCKSS Alliance Board: Carol Pitts Diedrichs, Dean of Libraries, William T. Young Endowed Chair, University of Kentucky Libraries; Nancy L. Eaton, Dean of University Libraries, The Pennsylvania State University; David S. Ferriero, Andrew W. Mellon Director and Chief Executive of the Research Libraries, New York Public Library; Brinley Franklin, Vice Provost for University Libraries, University of Connecticut; Michael A. Keller, Ida M. Green University Librarian, Director of Academic Information Resources, Publisher of HighWire Press, Publisher of Stanford University Press, Stanford University; Susan K. Nutter, Vice Provost and Director of Libraries, North Carolina State University; Ann Okerson, Associate University Librarian, Collections & International Programs, Yale University; Carton Rogers, Vice Provost and Director of Libraries, University of Pennsylvania
2. CLOCKSS Initiative Members:
- Publishers - American Medical Association, American Physiological Society, Blackwell, Nature Publishing Group, OUP, SAGE Publications, Springer, Taylor and Francis, John Wiley & Sons, Inc. In addition, Elsevier is participating in all discussions and is sharing in financial support.
- Libraries - Edinburgh University, Indiana University, New York Public Library, Rice University, Stanford University, University of Virginia
