 |
 |
 |
 |
 |
 |
 |
 |
Editors' Interview |
|
 |
 Editors' Interview with Victoria Reich, Director, LOCKSS Program
 |
 |
 |

Editors’ Note: E-journal archiving is receiving a lot of attention these days. We are pleased to present this editors’ interview with Victoria Reich, Director of the LOCKSS Program, who will describe recent work associated with the deployment of the LOCKSS (Lots of Copies Keep Stuff Safe) technology. Later this spring, we’ll offer an editors’ interview with Eileen Fenton, Executive Director of Portico, another e-journal archiving initiative.
First there was the LOCKSS system. You launched the LOCKSS Alliance in early 2005. In January 2006, you announced the CLOCKSS initiative. Would you provide a brief description of each and compare/contrast their contributions to community efforts in digital preservation ?
The LOCKSS (Lots of Copies Keep Stuff Safe) Program was motivated by the emergence of important scholarly information on the Web and a basic belief that a core of a library is its collections. The LOCKSS software offers libraries a cost effective and easy way to build digital collections of Web-based content. Digital information is extremely fragile and preservation must start from the moment it is put into circulation.
The LOCKSS Program has worked to introduce new best practices and reinforce others. Specifically, the LOCKSS Program holds that the following concepts are required for a robust digital preservation system:
- Replicate the content in independent repositories.
- Audit the bits and bytes. Digital content is fragile. Highly reliable off-line back up is very expensive. If files are continuously compared and damage automatically repaired, off-line back up can be eliminated.
- Don't touch! Minimize processing, migrate on demand, and preserve presentation and look and feel.
- Open source software is critical. Provide the community with a transparent mechanism to confirm processing claims. Guard against dependence on limited, centralized technical expertise.
- Allow no single points of failure. Strive for diversity in administration, funding, and technology.
- Have extremely cost-effective processes.
The LOCKSS system is format-agnostic and will collect and preserve content in any format that can be delivered over the Web (i.e., via HTTP), has a stable URL structure, and changes at a moderate pace. The system is transparent and preserved content is delivered to browsers exactly as the original publisher would have delivered it provided the browser will accept that format.
The single greatest threat to materials being preserved over the long term is money. Societies will have good times and bad. Keeping content safe must be a marginal expense in order to decrease the threats during bad times as well as to maximize available funds for new acquisitions during good times.
The LOCKSS Alliance, launched in 2005, is a library membership organization. The Alliance is governed by a library board [1] and advised by a publisher committee. Digital preservation is best done as a collaborative, community effort. This allows community control over the direction of the technology and applications. We’ve already seen the impact of this approach. Together the LOCKSS Alliance members have directed development of the system and are using the software to build local collections and preserve a wide variety of formats and genres. In addition to subscription and open access electronic journals, LOCKSS Alliance members are collecting and preserving government documents, electronic thesis and dissertations, websites, in-house image collections, and, soon, books and blogs.
The community is demonstrating strong support for this approach. In 2005, through word of mouth and grass roots outreach, the LOCKSS Alliance garnered two thirds of the community support needed to be self-sustaining. As the Alliance community grows, the fees for LOCKSS Alliance membership will decrease: “many hands make light work.”
The CLOCKSS (Controlled Lots of Copies Keep Stuff Safe) Initiative is designed to test the feasibility of a large, community-managed dark archive. The member librarians and publishers [2] are working together, as equals, to develop frameworks for publisher disaster failover systems and processes for providing public access to orphaned materials. CLOCKSS institutions range from 65 to 500 years old and have successfully met sustainability and survival challenges. Our work will be open and transparent, and there will be full public disclosure of CLOCKSS operations, governance, and technology. During this initial two-year project, CLOCKSS members will be working towards implementing a production system.
What do the terms “preservation” and “archive” mean in the LOCKSS Program context?
The fundamental goal of a digital preservation system is that the content stored in the system remains accessible to future readers for a time much longer than the lifetime of any individual component of the system. The National Research Council's study for the National Archives points out that “…the designers of a digital preservation system need a clear vision of the threats against which they are being asked to protect their system’s contents, and those threats under which it is acceptable for preservation to fail.” In the LOCKSS threat model, economic threats are a major focus. The technology is designed to minimize an institution’s total cost of ownership over the long haul and avoid front-loading the costs that do arise. For a comprehensive discussion of the threats to digital preservation see our recent D-Lib article.
Communities are using the LOCKSS digital preservation software to build archives. There are a number of examples, and among the best known are the NDIIPP-funded MetaArchive Project, the ASERL Electronic Thesis and Dissertations Project, and the CLOCKSS Initiative.
What kind of LOCKSS policies and procedures are there?
Our overarching policy is to keep procedures as simple as possible and still get the job done. Here are some brief examples:
- Human intervention is expensive and prone to error. Automate!
- Secrets are dangerous. Aim for 100% transparency and audit ability in everything.
- Legal frameworks change and paper trails are not reliable. Bundle and preserve legal rights and restrictions with the content to the greatest possible extent.
- Guard against technical arrogance. Build and use open source software and nurture a critical and contributing technical community. The LOCKSS software is fully documented and available on SourceForge.net. New updates are released about every six weeks. It’s easy to install a LOCKSS box.
- Collection development is a local activity.
- Building publisher relationships is a collaborative activity. The LOCKSS Alliance community is documenting practices for working with external content providers, such as electronic journal publishers and government entities (state and federal). Members are documenting practices for working with providers of internal content, for example electronic thesis and dissertations, manuscripts, and special collections.
- Our industries’ business relationships have worked fairly well for 500 years—endeavor to strengthen these, not to break them.
What is the nature of the relationships established with publishers? How does the LOCKSS manifest page work?
The LOCKSS Program encourages publishers to publish and libraries to take custody of the record of scholarship. The relationships between library and publisher are well-established and LOCKSS keeps this two-party relationship.
Many publishers are comfortable with LOCKSS Alliance member libraries exercising their traditional social obligation as long-term cultural memory organizations. The LOCKSS Alliance members are major US and UK research libraries and are governed by strong intellectual property practices.
Publishers have complete control over which libraries they permit to preserve, what specific content they can preserve, and when they take custody of that content. Publishers must give permission to libraries to preserve content on the LOCKSS system. They give permission by posting a “LOCKSS publisher manifest page” on their website for each “archival unit” to be preserved. In traditional journals, an archival unit is a volume. A LOCKSS manifest page provides:
- A set of links that the crawler will follow to find content to collect.
- A permission statement saying that the LOCKSS system has permission to collect and preserve the content found by following the links.
Publishers post the manifest page with the full text of the content behind the publisher’s access control wall. This ensures that only subscribers have permission to preserve the content. This permission is bundled with and preserved with the content. Furthermore, each LOCKSS box is clearly identified both by an “identification tag” and an IP address.
Through the manifest page mechanism, publishers give permission to authorized libraries to preserve their content. The use and access restrictions of subscription-based content are governed by the original license agreement. If a publisher’s terms and conditions are fairly stable across customer bases, we urge them to put rights and restrictions on the LOCKSS publisher manifest page so this information is bundled with and preserved with the content.
A manifest page is not always required. The LOCKSS system ingests content from websites that support OAI-PMH with permission but without the need for a manifest page full of links. For open access publishers, we encourage use of an appropriate Creative Commons license. This machine-readable license is then preserved with the content, making clear current and future rights. Also, in response to community requests, we are currently working on methods to preserve blogs. We expect that in most cases this will be done via RSS. Other ingest mechanisms may also be implemented, such as Google SiteMaps.
Are publishers’ permissions to preserve content revocable?
Once a publisher has given a library permission to preserve an archival unit and the library’s LOCKSS box is preserving that content, the publisher cannot take back the content. This is analogous to the library’s buying a paper journal and putting it on the shelf. The publisher doesn’t walk into the library and say, “I want back my journal run!” The publisher can, however, prevent further and future preservation activity by not putting online new publisher manifest pages and/or by removing manifest pages for back content.
What kind of content is captured from publisher websites and how is it stored?
A LOCKSS box collects and stores everything that a reader can see on a publisher's website and stores it bit-for-bit as it is received. The LOCKSS system preserves the look and feel of a work, or the “performance” of a work. The publishing world is moving away from merely replicating printed look and feel for the Web environment. In the mid 1990s, major peer-reviewed science journals began to include supplemental materials (databases, movies, images), and media-rich humanities journals, such as Exquisite Corpse and Vectors, are now proliferating.
How does the polling process work?
Each LOCKSS box collects its content independently. Each box with a given collection then compares the content it collected with others that collected the same content through the polling process, which involves voting on the hash of the content. If they reach consensus, we treat the consensus as definitive; the boxes that were in the minority re-fetch content from the publisher and repeat the comparison. If after repeated attempts a box is still in the minority, the software sends an alert to its operator.
Since publishers do not (for good technical reasons) sign the Web pages they publish, if multiple versions of a page are collected from a publisher's website, there is no algorithmic way to know which is authentic; some human intervention is necessary. Note that each box stores every version of a page that it receives, although versions that don't reach consensus are regarded as suspect.
How many LOCKSS boxes are needed for a reliable vote?
When building private LOCKSS networks, for example the MetaArchive project, we recommend a minimum of 6-7 replicas and expect all of them to take part in every poll. The production LOCKSS system has over 100 boxes, although not all of them have all possible content collections. We currently set system parameters so that a typical poll in this system has 7-10 votes.
How much redundancy is needed?
The question of redundancy is somewhat misleading in the LOCKSS context. Each library collects and preserves content for its own readers alone; it does not in general have permission to disseminate the content to unaffiliated readers. The LOCKSS system does not support file sharing, and this is a major reason why publishers are prepared to cooperate. Thus, the number of copies of given content in the LOCKSS system as a whole is a side effect of the individual collection development processes at the member libraries. It is not a global system management decision.
That being said, each library may set a threshold number of copies that it considers adequate for reliability. If that library’s box fails to find many other copies existing on other libraries’ LOCKSS boxes, it will alert the operator to persuade fellow librarians to collect the content in question to aid in its survival.
The LOCKSS system has been criticized for excessive redundancy. It is true that if your goal is to use disk resources as efficiently as possible, minimizing the number of replicas consistent with reliability would be a sensible policy. Someone who believes the cost of disk storage is the major cost in the system might chose such a policy. Fortunately disk storage is among the least expensive costs in the system. The LOCKSS system is carefully designed to minimize the use of expensive resources, such as a lawyer’s billable hours and attention from skilled system administrators, by using other, cheaper resources, such as disks, lavishly.
How does the LOCKSS system ensure, over long periods, that the content in the system is the same as when it was originally collected from the publisher?
Once consensus has been achieved through the polling process and the canonical version from the publisher has been determined, LOCKSS boxes compare their content regularly with others. There are two possible reasons for disagreement at this stage (after initial consensus has been achieved):
- Random damage to an individual LOCKSS machine’s content
- A malicious attempt to alter preserved content
The probability that a significant number of LOCKSS boxes would suffer identical damage in the period between audits is extremely small. Thus we distinguish between “incoherent,” or random damage and “coherent,” or malicious damage. Incoherent damage is detected and repaired automatically since the damaged box will lose a poll by a landslide. The techniques for detecting and defeating attempts at coherent damage are complex and are described in an award-winning paper presented at ACM's 2003 SOSP workshop.
Briefly, it is extremely improbable that an attacker would know enough about the state of the system to alter the preserved content without being detected. Note that because the LOCKSS boxes are independently administered, these defenses work even against attacks by “insiders,” such as authorized administrators of individual LOCKSS boxes. Insiders are a leading cause of computer abuse incidents and should be a part of every digital preservation system’s threat model.
How can defective (failing media, under attack, insecure) LOCKSS boxes be detected?
Each LOCKSS box treats all other individual LOCKSS boxes with total suspicion. LOCKSS boxes take actions only on the basis of a consensus of a sample of the other boxes. The integrity of the system does not depend on identifying faulty or subverted boxes and taking immediate action to exclude them. See SOSP 2003 for further details.
Have you been able to determine the causes of files needing repair?
We have detected failures at many levels of the system, for example ingest, disk storage, and the machines themselves. For more details on the reliability of digital storage, see our paper to appear at Eurosys 2006.
All components of an information system are unreliable in the timescales needed for digital preservation. It must be a fundamental design principle of any digital preservation system that it be capable of tolerating failures of any of its components. Thus, it would not be a criticism of the LOCKSS system, or any other digital preservation system, that it had not yet detected and repaired any errors. The fault tolerance mechanisms are necessary ab initio; they do not need to be justified by experience over a limited time span. Rather, it is the absence of fault tolerance mechanisms that is a criticism.
Would you give us some examples of technical developments in the works?
LOCKSS boxes currently communicate using a protocol we call LCAP V1, which is essentially unchanged since the early LOCKSS prototype. During 2006, we will be gradually phasing out V1 and replacing it with a new protocol, LCAP V3, which is now being tested. The design of V3 is the result of a four-year effort funded in part by a major grant from the National Science Foundation and involving more than a dozen researchers from the labs of Sun Microsystems, Intel, Hewlett-Packard, and the computer science departments of Stanford and Harvard. This team has presented major papers at ACM’s SOSP workshop, in ACM's Transactions on Computer Systems, at the Usenix conference, at the upcoming Eurosys conference, as well as many workshop papers.
The LOCKSS system currently ingests content via Web crawling and OAI. The community has asked us to expand this to include RSS so a wider variety of materials (blogs and other more rapidly evolving content) can also be collected and preserved.
A third initiative is to provide usage and content delivery statistics to LOCKSS box administrators. Both the publishers and librarians are interested in the frequency that a user is unable to reach the publisher’s server.
What components of OAIS does the LOCKSS system address? Does the LOCKSS system address quality control? What metadata goes along with digital objects in the LOCKSS system?
The LOCKSS system complies fully with the requirements of the OAIS standard. See the formal compliance statement on the LOCKSS website for details.
The LOCKSS system has automated audit mechanisms to ensure that (a) what is preserved is the consensus of many independent collections from the publisher, (b) what is available now still represents the consensus of many independent replicas, and (c) any deviations from these quality standards are reported to the operators of the LOCKSS boxes concerned.
The LOCKSS system collects all metadata available from the publisher. We are working towards integrating automated metadata extraction technology, both for format metadata (JHOVE) and bibliographic metadata (perhaps using technology from Rexa). As MacKenzie Smith has pointed out, human-generated metadata is so costly that dependence on it can be an Achilles Heel of digital preservation. The LOCKSS system is designed to be as cheap as possible for libraries to operate and, therefore, uses automated processes for all normal operations. Human intervention is a last resort.
What happens to current content if the LOCKSS Program ceases to exist (e.g., succession planning for trusted digital repositories)?
Each library owns and operates its own LOCKSS box. If at any time they want to switch to an alternate system for digital preservation, they have the technical capability to extract the entire content of their box in the exact form that it was originally collected. The content can then be submitted to an alternate system. Whether they have legal permission to do so would depend on the copyright holder.
What are the costs to institutions for using a LOCKSS box? What resources are required (e.g., time, programming) for working with publishers, adding a title, maintaining the LOCKSS box, etc.? What are your costs for maintaining and developing the current system?
Institutions are welcome to download the free, open source software and install it on whatever computer is appropriate for their particular application. The cost of working with publishers is directly proportional to the size of the publisher. Bringing a LOCKSS box online involves downloading the software, burning a CD, and running an installation script that takes about 8.5 minutes. There is the cost of the LOCKSS box; library LOCKSS boxes are ordinarily PCs. We urge libraries to join the LOCKSS Alliance so they can easily and cost effectively have a direct hand in building and preserving digital collections for future readers.
By policy, the costs for the LOCKSS team at Stanford are not growing. As mentioned earlier, it is extremely important for a digital preservation system to eliminate any central points of failure. A particularly expensive and vulnerable point of failure is technical expertise. As the system grows, we are growing an international, open source community.
How scalable are the LOCKSS boxes? How many objects can one LOCKSS box handle before audit checking overwhelms it?
A LOCKSS box needs a balance between computational bandwidth and storage capacity. We are presently evaluating state-of-the-art, low-cost storage boxes from Capricorn Technology that cost about $3500 for 2TB of storage. Although they are far less powerful than typical desktop or even laptop PCs, they appear to have adequate computational bandwidth to handle 2TB of content. More powerful and expensive hardware could handle more content per node, but at a higher cost/byte.
We have simulated systems of up to 1000 LOCKSS boxes. This is significantly larger than any real world system.
How are you dealing with format obsolescence?
We have currently implemented the necessary framework for format migration and demonstrated that it works. Our next steps are to provide an API to which format converting plugins can be written and to create a registry of converter implementations.
How are you addressing obsolescence of the LOCKSS box itself?
The LOCKSS software comes in two parts. The LOCKSS daemon is written in Java and requires only a standard Java virtual machine to run. We routinely run it on OpenBSD, Linux, and MacOS X. It can also run on Windows and FreeBSD. Given the huge volume of mission-critical software with similar requirements, the commitment of major vendors such as Sun Microsystems and IBM, and the availability of open-source JVM implementations, obsolescence of Java is not an immediate concern. Nevertheless, the LOCKSS team has a medium-term goal of persuading some other team(s) to write independent implementations of the LOCKSS protocol. Doing so would improve the reliability of the system even if Java were never to become obsolete.
The LOCKSS platform requires a generic PC and OpenBSD. The economics of technical markets make early obsolescence of the generic PC not an immediate concern. OpenBSD has an established history, but we routinely use other operating systems and if it became necessary could easily switch the entire system to use another operating system.
What kinds of initiatives are institutions interested in developing?
The library community is interested in collecting and preserving a wide variety of digital materials with a tool they can develop and manipulate to serve their local communities cost-effectively over the very long term.
During 2005 the following kinds of materials were collected and preserved on either the international or local LOCKSS system networks: special collections (including in-house digitized collections), websites, institutionally published materials, government documents (state, federal), electronic thesis and dissertations. Major projects that are using LOCKSS technology to preserve materials other than electronic journals are:
What does the future look like for the LOCKSS Program?
Very bright and extremely busy!
What will success look like for the LOCKSS Program, the LOCKSS Alliance, and the CLOCKSS Initiative?
However long you watch a digital preservation system, you can never be sure it will provide access when it is needed in the future. The LOCKSS Program, the LOCKSS Alliance, and the CLOCKSS Initiative are working to increase the odds that future readers will find today’s content when they need it. Notes
1. LOCKSS Alliance Board: Carol Pitts Diedrichs, Dean of Libraries, William T. Young Endowed Chair, University of Kentucky Libraries; Nancy L. Eaton, Dean of University Libraries, The Pennsylvania State University; David S. Ferriero, Andrew W. Mellon Director and Chief Executive of the Research Libraries, New York Public Library; Brinley Franklin, Vice Provost for University Libraries, University of Connecticut; Michael A. Keller, Ida M. Green University Librarian, Director of Academic Information Resources, Publisher of HighWire Press, Publisher of Stanford University Press, Stanford University; Susan K. Nutter, Vice Provost and Director of Libraries, North Carolina State University; Ann Okerson, Associate University Librarian, Collections & International Programs, Yale University; Carton Rogers, Vice Provost and Director of Libraries, University of Pennsylvania
2. CLOCKSS Initiative Members:
- Publishers - American Medical Association, American Physiological Society, Blackwell, Nature Publishing Group, OUP, SAGE Publications, Springer, Taylor and Francis, John Wiley & Sons, Inc. In addition, Elsevier is participating in all discussions and is sharing in financial support.
- Libraries - Edinburgh University, Indiana University, New York Public Library, Rice University, Stanford University, University of Virginia
 |
 |
 |
 |
 |
 |
 |
 |
 |
Feature Article |
|
 |
 NEDCC Survey and Colloquium Explore Digitization and Digital Preservation Policies and Practices
Author: Tom Clareson - PALINET (clareson@palinet.org)
 |
 |
 |

The Northeast Document Conservation Center (NEDCC), with funding from a National Leadership Grant from the Institute of Museum and Library Services (IMLS), conducted an online survey in April/May 2005 on digital collection policies and practices. This survey was the first component of a project to develop a methodology for assessing the digital preservation readiness of cultural heritage institutions. Although the findings clearly illustrate the growing presence of digitization in libraries, archives, and museums, there is a distinct lack of policy to deal with the preservation of these items once they are created. Utilizing the results of the e-mail survey, NEDCC convened a colloquium of digitization and digital preservation experts in July 2005 to discuss the digital preservation needs of cultural heritage institutions and how to begin addressing those needs. The next step in this project, beginning in early 2006, will be to conduct on-site digital preservation readiness assessments at selected institutions to test a new digital preservation needs assessment methodology.
Survey Development and Methodology
The NEDCC Surveying Digital Collections project began in 2004, with cooperating partners including the American Institute for Conservation (AIC), the Center for Research Libraries (CRL), Heritage Preservation, and the Museum Computer Network (MCN). Its origins are an NEDCC grant proposal to IMLS for a Research and Impact Study to develop tools for assessing the preservation needs of museums’ digital collections. In response to “sweeping changes in museum operations as they begin to digitize collections and make them available” via the Internet, NEDCC and its partners planned to explore emerging technical issues “as a basis for drafting planning tools and guidelines that will help museums maintain and preserve digital assets.”
NEDCC’s proposal had three steps: developing and implementing a survey on the status of digital collections in cultural heritage institutions; convening a colloquium to focus on ways in which museums establish the value of their digital collections and prioritized their digital preservation activities; and developing practical guidelines and tools, which could assist museum administrators and staff in planning for the management, maintenance, and preservation of their existing and future digital resources. NEDCC staff also saw the benefit of the survey process as an educational tool, providing cultural heritage institutions with a broader understanding of digital readiness and longevity issues, and helping museums, libraries, and archives plan more effectively for future development in their digital capabilities.
An Advisory Committee developed a prototype questionnaire, to gather data on a wide variety of topics related to digitization and digital preservation. Sections of the survey focused on topics including:
- Preservation readiness, with questions regarding traditional and digital preservation activities
- Information technology infrastructure
- Creation and acquisition of digital collections
- Delivery methods for digital collections
- Administration and management practices for digital collections
- Rights and licensing issues for digital materials
NEDCC engaged TDC, a Boston firm that performs marketing research and analysis, to format the survey for the Web, administer the survey instrument, and tabulate survey data. On April 25, 2005, the survey was posted on the Web, and a link was e-mailed to a total of 1350 individual e-mail addresses divided among the library, archive, and museum communities. Additionally, the survey link was posted to four popular e-mail lists in the cultural heritage community:
- American Library Association – Preservation Administration Discussion Group (PADG)
- Conservation of Archive, Library, and Museum Materials (ConsDistList)
- Museum Listserv (Museum-L)
- Southeastern Museums Conference – Registrar’s Committee listserv
A reminder e-mail was sent to the individuals and posted on the e-mail lists on May 6, 2005. The survey process ended on May 16, 2005 with 174 total responses received, 169 determined to be valid, for a response rate of 12.5%.
Survey Results: Institutional Data
The majority of the institutions that responded to the online survey instrument were college and university libraries (55 responses or 33.1%), archives (24 responses or 14.5%), art museums (15 responses or 9%), and public or municipal libraries (13 responses or 7.8%). Other categories of cultural heritage institutions that answered the survey included: anthropology/ ethnology, natural history and science/technology museums; historical societies; and state libraries. In addition, there were 41 responses (or 24.7% of the total) categorized as “other” types of institutions; most of these were additional types of museums, such as private and historic estate museums, historic site/museums, and regional library/museum/institutions. Survey respondents were most commonly preservation officers, archivists, librarians, or administrators.
The survey respondents represented institutions with a wide variety of staff sizes and operating budgets. Many smaller institutions (especially archives) with 1-20 full-time equivalent (FTE) employees responded (64 or 38.8% of all responses). Those with 21-100 employees (42 or 25.5%) and 101-500 FTE (46 or 27.9%) were also well-represented, and this group included mainly college and university libraries. The very largest institutions polled (501 staff or greater) provided 13 responses or 7.9% of the total.
 Figure 1. Respondents by Full-time Equivalent (FTE) Employees
Annual operating budgets of institutions responding to the survey ranged from 32 institutions (20%) with annual operating budgets of $250,000 or below; 37 respondents (23%) with budgets ranging from $250,000-$1,000,000, and the majority of the respondents (66, or 40%) with budgets of $1,000,000-$20,000,000. In addition, there were 28 institutions (17%) which reported budgets of $20,000,000 and above—most of these were college and university libraries.
 Figure 2. Respondents by Operating Budgets
A few questions dealt with overall preservation practices, including analog and digital materials. Most interesting were the results of questions dealing with written institutional policies, plans, or procedures. A strong majority of responding institutions had, or were developing, documents including Mission and Goals, Collection Development, and Emergency Preparedness policies. An indication of the shortfall in digital collections management is that a pronounced minority of written policies for activities such as collection development, emergency preparedness, preservation, and especially exhibitions, specifically address the institutions’ digital holdings. The lack of these types of policies was most noticeable in the art museum and archival communities. The traditional preservation activities most likely to be carried out by responding institutions were environmental monitoring, item-level condition assessment of objects, and emergency preparedness activities. By far the weakest area of policy development among respondents was in the crafting of written plans and procedures focused on the creation of digital resources. Only 29% of respondents had such a policy. Archives, as a group, were especially weak in development of this type of policy, while institutions with larger budgets were leaders in this type of policy development. A positive sign was that policy development for digital resource creation was the area in which the largest group of respondents (41%) said a policy was being developed.
Similar to findings in recent statewide preservation needs assessment surveys, 102 (63%) respondents said that 5% of their budget or less was devoted to any type of preservation activities. Nine percent of the response pool (14 institutions) had no funds whatsoever allocated for preservation activities.
Information Technology Issues and Infrastructure
Information Technology (IT) staff are key in the success of cultural heritage digitization projects. Almost all of the respondents have information technology services provided at their institution including network support, desktop support, security and protocols, back-up and disaster recovery, centralized hardware and software acquisition and maintenance, and file management and storage. When asked if institutions supported information technology applications for digital collection management, a clear majority responded that they were active in digital imaging, Web design and development, and building collection information management databases. Over half of the institutions polled provided the public with the capability to search the institution’s collection database online.
The management of information technology services and policymaking at these institutions provided some surprises. While 115 (68.9%) respondents had an IT Department, 52 (31.1%) did not. In those facilities with IT as a separate department, staffing levels were relatively low, with 139 (82.2%) having 0-4 FTE assigned responsibility for some information technology activities and 29 (17.2%) with 4-8 staff members. In 71 institutions (43.8%) there was not a specific person in assigned primary responsibility for information policy. This was especially true in the archival community. Information policy was a full-time responsibility of a staff member at 52 institutions (32.1%) and part-time at 39 facilities (24.1%). In the majority of cases where there was a staff person with information policymaking responsibility, this responsibility fell to the Library Director/Librarian (18 or 20% of respondents); Director, Vice President, or Manager of Information Technology (11 or 12.2%); or Information Systems Manager (also 12.2% of respondents). A larger number of responding institutions had a staff member with full- or part-time primary responsibility for systems administration and/or database administration.
A key finding in this area is that, while 36 respondents (24%) allot over 10% of their annual budget to information technology, there were a number of cases (25 responses or 17% of the total) where no or very low funding levels exist for IT support. Respondents from historical societies and archives were among those groups with a high percentage indicating no funds expended in this area. A positive trend was that 44 percent of respondents (72 cases) noted that their institution’s IT budget has increased over the past five years, while 28.4% said this budget area has remained stable. It is likely that some of this funding went to systematic updating of software, which 142 respondents (86.6%) said they did, and/or for systematic upgrading of hardware (132 or 80%). An interesting finding in this area is that mid-sized institutions (those with budgets between $10,000,000-20,000,000) were less apt to upgrade their hardware and software routinely.
Digital Collection Creation and Acquisition
A vast majority of the respondents (151 respondents or 92%) had created digital assets from physical source materials. The following results are from the portion of the survey instrument that looked at the types of digital collections currently being created and/or acquired.
The variety of formats being digitized was wide-ranging. Eighty-two percent of respondents (142) had digitized flat works on paper/photographic prints. Books and other multi-page items, analog audio and video, three-dimensional objects and motion picture film were other top-rated answers to a question that asked respondents to select all of the categories that applied to their collections.
|
If so, from which of the following types of source materials (Select ALL that apply) |
|
Flat works on paper/photographic prints |
142 |
84% |
|
Books and other multi-page items |
83 |
49% |
|
Analog audio and video |
67 |
40% |
|
Three-dimensional objects |
66 |
39% |
|
Film |
53 |
31% |
|
Microfilm |
39 |
23% |
|
Other |
12 |
7% | Table 1: Types of Physical Source Materials Being Digitized.
In addition, many respondents stated that their institutions were already engaged in collecting, acquiring, or creating digital assets, most of which were in the categories of simple text; photography and other still image materials; and music, spoken word, or other audio materials.
One of the key findings of this survey was that 59 respondents (39%) said that the majority of the items they consider to belong to digital collections are unique, single-copy works. As one might expect, archival respondents reported a majority of these unique materials. This finding supports the good practice of acting to preserve both the digital and original copies of works that are thought to be one-of-a-kind.
|
Of all the items you consider to belong to digital collections, are the majority (Select ONE): |
|
Unique (single copy works) |
59 |
39% |
|
Replicated in both digital and analog versions |
43 |
29% |
|
Replicated in other analog versions |
28 |
18% |
|
Replicated in multiple digital versions |
21 |
14% | Table 2: Types of Content in Digital Collection
Most responding institutions knew what type of file formats (TIFF, JPEG, etc.) existed in their collections and said they held three or fewer formats of original items, including text, encoded text, still and moving images, sound, and geo-spatial materials.
Digital Collection Delivery Methods
A somewhat surprising finding was that internal staff (147 respondents or 87%) and on-site visitors (99 or 58.6%) were the leading users of digital collections. At 92 (54.4%) of the institutions, there was general availability of digital collections via the Internet or another network to users worldwide. Additionally, 61 (36.1%) of the institutions polled offered controlled availability, via the Internet or another network, to selected external users. The control of access to collections was particularly evident among archival respondents. The leading method of providing access to digital collections was through a website associated with the institution (113 or 66.9%); removable media and stand-alone computers were each utilized by about one third of the responding institutions. Evidence of collaborative digital activities was also discovered in the answers to this question: nearly one-quarter of the responding organizations (41) made available at least a portion of their digital collections through a website associated with another organization.
Another positive finding of the survey was that 83% of the respondents had created descriptive metadata for their digital assets in order to facilitate discovery and use of their digital collections. Descriptive metadata was the most common form of metadata created. At least half of the respondents also created technical and administrative metadata. When presented with nine different activities and asked to “rate the importance of the following goals of your digital program on a scale of 1 (least important) to 5 (most important),” the “identification” of digital items, in which metadata plays a key role, was most often rated (98 respondents) as the most important goal. Seventy respondents rated “study and use by local users” as most important, which far outweighed “study and use by remote users” and “reducing handling” (identified as most important by 44 respondents each). And, although it is always a discussion topic in the digitization community, “generating revenue” was reported most as the least important goal by the highest number of respondents.
Finally, the survey results indicated that responding institutions used a mix of methods to manage their metadata, including integrated databases, stand-alone databases, file-directory conventions, and file structures. No single methodology was a clear leader in this area.
Administration and Management of Digital Collections
A high number of institutions reported having no or low levels of institutional funds allocated for creation, acquisition, management, or sustainability of digital collections. While 25% of the institutions responding do not assign any portion of their budget to create digital collections, an even greater number (almost 42%) do not have budget lines for acquiring digital collections. The lack of budget for acquisition and maintenance of digital materials was most clearly evident among the archive, public library, and ethnological/anthropological museum respondents. Most of the institutions that are directing some funding into digital activities allocate only between 1-5% of their institutional budget for these activities. For example, a majority of college and university libraries allot less than 1% of their budgets to digital content creation.
| Percent of Funds Allocated to Digital Collections Administration and Management |
| |
0% |
0-1% |
1%-5% |
6%-10% |
>10% |
|
If your institution creates digital collections, approximately what percentage of your institution's annual budget is allotted to this activity?
|
39 (25%) |
58 (37%) |
41 (27%) |
12 (8%) |
5 (3%) |
|
If your institution acquires digital collections, approximately what percentage of your institution's annual budget is allotted to this activity? |
62 (42%) |
33 (22%) |
33 (22%) |
11 (8%) |
9 (6%) |
|
Approximately what percentage of your institution's annual budget is allotted to manage and sustain digital collections you have created or acquired? |
46 (30.3%) |
46 (30.3%) |
47 (30.9%) |
8 (5.3%) |
5 (3.3%) | Table 3: Percent of Funds Allocated to Digital Collections Administration and Management
According to respondents, current or future funding specifically targeted toward digital preservation will be derived from institutions’ regular budgets (43 respondents or 25%) through grant funding (37 respondents or 22%) or in an approach that is a hybrid of grants and regular budget funds.
Sixty percent of respondents indicated that there was not a specific person assigned responsibility/primary activity for digital preservation. In the cases where there was a staff person with responsibility for maintaining digital collections, it was more often preservation staff (64.6%) than IT staff (47.4%) with this set of tasks as part of their activities.
A bright spot in the survey results was that 136 institutions, or almost 84%, supported staff development and professional education/training in the area of digital preservation. However, it seems as though this commitment to education has not yet translated into policy development.
With regard to digital preservation, participants were asked to select all strategies their institution had discussed implementing. Answers included regular data backup (utilized by a vast majority of the respondents), migration, and refreshing the data. Other strategies used are the maintenance of legacy equipment (hardware to read obsolete or less-often-held formats such as floppy disks), outsourcing digital preservation to an externally-managed repository, and emulation. The types of media used for storing digital collections are online magnetic media, such as network hard drives (78%), or removable magnetic media (65%). Digital collections are most often stored in-house in systems managed by the institution or by a combination of in-house, partner organization, and storage vendor (outsourced) methods.
In two other strategic areas, there is cause for concern. Eighteen respondents (11%) are not utilizing a backup strategy at all, and 30 (19%) back materials up once, which means that 30% of collections are not adequately protected by a backup strategy. Also, when asked if their digital assets are insured, 84 respondents (52.8%) said no, 58 (36.5%) did not know, and only 17 institutions, or just over 10%, said they insured these holdings. Clearly, these are areas for further digital policy and practice development.
The majority of respondents noted that the maximum lifespan required to both maintain digital materials and to retain their deliverability and usability so the materials can serve their intended purposes was 25 years or more, which was the longest option offered by the questionnaire.
Finally, the survey touched on some specific rights and licensing issues. A majority of respondents consider copyright and intellectual property concerns in their digitization activities. Most also attempt to acquire digital rights to materials when they are acquiring or collecting digital materials.
Conclusions from the Survey
Clearly, digitization is fast becoming a routine activity in many museums, libraries, and archives. More than 90% of the responding institutions are creating digital assets from physical source materials; over 88% are either collecting, acquiring, or creating digital assets. In addition, institutions are considering digital preservation issues – supporting staff development and education on the topic and establishing 25 years or more as the targeted maximum lifespan for deliverability, use, and maintenance of digital materials. A majority are creating metadata for their digital collections and are taking copyright issues into consideration in their creation workflow.
However, the creation of policies addressing the preservation and management of digital assets lag far behind other areas of policy development in cultural heritage collections. Only 29% of the institutions responding to the survey have policy, planning, or procedure documents on the creation of digital resources, although 41% say policies are in development. A gap between digitization and digital preservation practice is further suggested by the fact that, except for inclusion in rights and licensing policies, digital holdings are not included in the majority of policy statements for many areas of institutional operation, from mission and goals to emergency preparedness, to exhibit policies.
Moving Forward
Following this survey, NEDCC convened a colloquium of fourteen digital preservation experts in Boston on July 11-12, 2005 to address the lack of policymaking for digital preservation. The group reviewed the survey results and discussed digital preservation needs and proposed solutions. The group considered preliminary data from the Heritage Preservation “Heritage Health Index,” which looks at the preservation needs of all formats of cultural heritage collections. It reviewed a variety of existing assessment models, especially the IMLS-funded Conservation Assessment Program (administered by Heritage Preservation), which enables museums to work with outside conservation consultants to review their preservation needs and write a report identifying priorities.
The key conclusion from the colloquium was that small and medium-sized institutions will need the assistance of experts to assess the preservation status and needs of their expanding digital collections. The group suggested development of expert-facilitated on-site assessments, supported by an online assessment tool on digital preservation. The participants discussed methods to expand this type of technical assistance into a national program.
As a result of the colloquium discussions, a subgroup of the colloquium attendees, along with experts from the collaborative digitization arena, are refining the existing questionnaire on digital readiness to use as a diagnostic instrument, and testing it through a limited number of site visits conducted by expert teams.
The first of these site visits took place in January 2006. In preparation, the libraries, museums, and historical society selected for on-site visits completed the revised questionnaire on digital readiness. Then, two teams of two surveyors spent one day at each site discussing digital program development and digital preservation issues. The surveyors have reported that on-site discussions with key digital resource personnel enabled them to develop a better picture of digitization practices and digital preservation status than the questionnaire alone, allowing them to advise on key issues and questions. The surveyors have remarked that the on-site visits are producing “not just a survey of digital preservation at these institutions, but an overview of all digital policies and practices.”
A total of 6-8 surveys will be conducted by the teams during this pilot/testing phase of the on-site survey portion of the project. Further information about the effort will be released to the cultural heritage community as it becomes available.
NEDCC hopes to use the results of the online survey, colloquium, and on-site survey process to develop a methodology for digital surveying and tools that can be utilized in training additional consultants to perform the on-site digital program surveys.
 |
 |
 |
 |
 |
 |
 |
 |
 |
Highlighted Web Site |
|
 |
 ConsortiumInfo.org
 |
 |
 |

 ConsortiumInfo.org
The old adage about not judging a book by its cover translates nicely to websites. Some sites have great visual appeal and their home pages dazzle you with graphics, but the substance beneath is disappointing. Others are far more spare in appearance, but rich in content. This issue’s Highlighted Web Site, ConsortiumInfo.org, definitely falls into the latter category.
Hidden under a very modest-looking home page is an impressive array of timely and useful information and services devoted to “standards, standard setting, and open source software, and on the role that these essential tools play in business and society.” All the more notable then, to learn that the site is owned and written almost entirely by one man, Boston-based attorney Andrew Updegrove. Boston just happens to be at the nexus of an extremely important battle in the war over open file format standards right now (see this issue's FAQ for more information) and Mr. Updegrove’s regular blog entries have provided essential details, documents, and commentary on the subject. Updegrove’s contributions to the standards community were recognized in 2005 by ANSI (the American National Standards Institute), which presented him with its President's Award for Journalism, the first time anyone other than a full-time journalist has received it.
The site contains numerous other features of potential interest to the readers of RLG DigiNews, and has greatly expanded its content in the past year, at a time when standards are playing an increasingly crucial role in digital imaging and digital preservation. Included are an exhaustive list of consortia and standards-setting bodies, guidelines for evaluating consortia and deciding on whether to get involved in standards-setting, a nearly monthly bulletin, Consortium Standards Bulletin (an “eJournal of news, ideas and analysis on the topics of consortia and standard setting”), a MetaLibrary containing over 1000 indexed and abstracted articles on standards and standard setting, and a standards and technology bookstore.
There’s more, but the best way to get to know the site is probably to start with the “About This Site” page and peruse the “Fourteen Ways to Get the Most Out of ConsortiumInfo.org.” Though the site is sponsored by a law firm, it is run like a non-profit, and all information on the site is free and open, including subscriptions to Consortium Standards Bulletin and several RSS feeds.
 |
 |
 |
 |
 |
 |
 |
 |
 |
FAQ |
|
 |
 OpenDoc Prescription a Bitter Pill for Microsoft in Massachusetts
Author: Richard Entlich - Cornell University (rge1@cornell.edu)
 |
 |
 |

I’ve been seeing a lot of headlines involving Microsoft and the Commonwealth of Massachusetts lately. What’s the story behind the headlines?
[Note: For further details and to see the exact order in which this story unfolded, check out the Chronology of Events]
Introduction
The Commonwealth of Massachusetts and its capital city Boston have a rich and colorful history of oppositional and downright revolutionary behavior. From the Boston Tea Party and “the shot heard round the world” at the battles of Lexington and Concord in the 18th century, to the stringent censorship of “salacious, immoral or offensive” culture in the 19th and 20th centuries that led to the immortalization of the phrase “Banned in Boston,” Massachusetts residents and government officials rarely shied away from controversy.
In the 21st century, a new fracas with its epicenter at Massachusetts’ seat of government has had all the elements of a modern media spectacle: drawn out litigation with David vs. Goliath overtones, accusations of insensitivity to the needs of a minority group, a major newspaper claiming possible conflict of interest in the acceptance of travel funds by a high government official1 who later resigned, contentious Senate hearings, an 11th hour amendment inserted by the legislature as part of a power struggle with the executive branch, and political positioning with potential impacts on gubernatorial and even presidential politics. What’s at the core of this new tempest? Health care? Taxes? Gay marriage? Would you believe ... file formats?
On September 21, 2005, the Commonwealth of Massachusetts’ Information Technology Department (ITD) released a revision of its Enterprise Technical Reference Model: Information Domain (ETRM) specifying that its executive branch agencies would be required to migrate office document software to applications able “to save office documents by default in the OpenDocument format” by January 1, 2007 and that “any acquisition of new office applications must support the OpenDocument format natively.” The only other acceptable format mentioned in the document was PDF v.1.5 or later. The government justified the new policy based on cost savings (for itself and taxpayers), improved interoperability, better document management, and enhanced long-term access to public records for government, the public, and historians.
This single action unleashed a pitched battle that traces its origins back to court fights between the Commonwealth of Massachusetts (hereafter CoMA) and Microsoft, and which has boiled over into a closely watched affair involving Microsoft, several other major software vendors, and various interests both inside and outside of CoMA government. In the meantime, what might otherwise have been an obscure internal IT document from a New England state became a cause célèbre of the open source movement and threatened the business model of the world’s largest software company.
Background
CoMA has established itself in recent years as one of the feistiest challengers to Microsoft’s software supremacy and closed, proprietary products. It was one of several states that filed class-action lawsuits against Microsoft, claiming violations of its consumer protection and unfair competition laws. CoMA was also one of 19 states (plus D.C.) party to the 1998 US Justice Department antitrust case against Microsoft. When the Justice Department settled its case against the company in November 2001, CoMA was one of nine states to appeal to the courts for harsher penalties. When a US District Court judge rejected those penalties in November 2002, CoMA was the only state to remain in the battle and file a further appeal, which a US Appeals Court ultimately rejected in June 2004.
Part of CoMA’s tenacious court battle with Microsoft centered on getting the company to improve the interoperability of its products and release more of its proprietary code as open source. Even before the Appeals Court ended CoMA’s efforts to reform Microsoft through litigation, the commonwealth was quietly beginning a campaign to gain control over its own software infrastructure using a new tactic. After five years of butting heads and millions of dollars in legal fees with little to show for its efforts, the CoMA legislature passed legislation, in September 2003, favoring open-source software and adherence to open standards in government computing systems. Rather than target a single company in the courts, CoMA decided to try to get all its software vendors to support its vision through revised procurement guidelines. By January 2004, CoMA’s Information Technology Division had released the first version of its Enterprise Open Standards Policy.
In January 2005, speaking before the Massachusetts Software Council annual meeting, CoMA Secretary for the Executive Office of the Administration of Finance, Eric Kriss, spelled out concerns in terms that would warm the cockles of any librarian’s or archivist’s heart. Here’s a brief excerpt, though the entire speech is worth reading.
It should be reasonably obvious for a lay person who looks at the concept of Public Documents that we’ve got to keep them independent and free forever, because it is an overriding imperative of the American democratic system that we cannot have our public documents locked up in some kind of proprietary format or locked up in a format that you need to get a proprietary system to use some time in the future.
So, one of the things that we’re incredibly focused on is insuring ... that public records remain independent of underlying systems and applications, insuring their accessibility over very long periods of time. In the IT business a long period of time is about 18 months. In government it’s about 300 years, so we have a slightly different perspective.
Open Formats insure also that there are minimal restrictions imposed on the use of applications needed to access those records and files. And finally, Open Formats support the integrity of public records when we’re going to need to do a file conversion as required probably in the course of normal technological evolution. So if we have something in a format of 2005, and it’s going to need to be converted in 2038 into something that we’ve never thought of yet, we need to be able to do that without losing the integrity in the underlying information.
Though seemingly throwing down the gauntlet to Microsoft, CoMA had, in fact, been consulting with many parties trying to come up with a meaningful understanding of what makes a software format truly open. In the process, it identified concerns over the terms of Microsoft’s patent license for the Office 2003 XML schemas, which had been released originally in December 2003. Even though Microsoft made use of its XML schemas royalty-free, its aggressive patent seeking in various uses of XML related to software development had raised concerns about its future intentions. In discussions with the software giant, Microsoft agreed to slightly modify the license, releasing a new version about two weeks after Mr. Kriss’s speech. The new version made the license perpetual and added a clarifying paragraph stating that
...given the unique role of government institutions, end users will not violate this license by merely reading government documents that constitute files that comply with the Microsoft specifications for the Office Schemas, or by using (solely for the purpose of reading such files) any software that enables them to do so. The term “government documents” includes public records.
For the short-term, the changes seemed to satisfy CoMA officials. ITD’s March 2005 draft ETRM:Information Domain identified Microsoft’s Office XML as meeting the criteria for an open format, along with RTF (Rich Text Format), plain text, html, PDF, and OpenDocument Format (ODF), which was pending approval by standards body OASIS (Organization for the Advancement of Structured Information Standards). ODF is an XML-based format used natively by open source office suites such as OpenOffice.org and StarOffice.
However, by the time ITD released the final policy draft for comment in late August 2005, only OpenDocument (which had subsequently been approved by OASIS) and PDF had made the cut. Exactly what happened between March and August to cause CoMA’s ITD to change its mind about including Microsoft’s Office XML isn’t completely clear (it’s the subject of a lengthy article by ZDNet’s David Berlind).
The call for comments elicited over 150 responses, running roughly 2:1 in favor of the proposed new policy. Support was widespread, encompassing residents of many US states and foreign countries as well as several major software companies, including Sun Microsystems and IBM (both major supporters of ODF), Adobe, and Corel.
Not all the comments were favorable. A small “astroturf” campaign of identical letters describing ITD’s plan as “unnecessary, wasteful, costly, taxpayer unfriendly, and harmful to the IT industry” (without explaining why) emerged. More troubling was a group of critical responses from CoMA officials, expressing concerns about cost, complexity, timing, and implementation details and from people with disabilities and related advocacy groups, pointing out limitations in ODF and compatible software’s ability to serve the visually impaired.
As might be expected, the bitterest opposition came from Microsoft. In a 15-page retort, Microsoft’s representative chastised the ITD and its parent department for what it claimed were violations of CoMA’s public comment and procurement regulations and characterized its decision to delete Microsoft’s Office XML schemas from the list of open formats as inconsistent, discriminatory, unprincipled, unreasonable and, by implication, terribly naive. It demanded that CoMA restore Office XML to the list of open formats. The one thing Microsoft didn’t do was offer any further concessions.
ITD responded with an FAQ posted on its website, providing point-by-point rebuttals to the objections raised by Microsoft and the other critics. It refuted concerns that the move to ODF might actually increase, rather than decrease, costs, and that it lacked the legal authority to invoke the new policy. In replying to claims that Microsoft’s Office XML specifications were “just as open” as PDF, ITD asserted that “[Microsoft’s] license is not as open as Adobe’s copyright license for PDF. Adobe’s copyright license for the data structures, operators and written specifications constituting the interchange format called the Portable Document Format or ‘PDF’ imposes minimal legal restrictions on developers.”
Of significance was the attempt to clarify the difference between specifying particular open source software and specifying a particular open source format. Many critics of the proposed new policy assumed it implied that Microsoft Office would no longer be permitted, or even, in the case of many of the complaints from the disabled, that it implied Microsoft Windows would have to be replaced by the open source operating system Linux. Linux currently lags behind Windows in its accommodation of users with many kinds of disabilities.
In order to avoid running afoul of competitive procurement and bidding requirements, ITD could not, in fact, specify the use of one particular software product. The policy states that any software used must be able to support one of the acceptable file formats natively. Clearly implied is that if Microsoft added ODF to its slate of supported file formats, its Office suite could join a long list (including OpenOffice, StarOffice, KOffice, Abiword, eZ publish, IBM Workplace, Knomos case management, Scribus DTP, TextMaker, and Visioo Writer) of acceptable products that already supported ODF.
Microsoft wasn’t biting, however, and when the final revised ETRM v.3.5 was released on September 21, 2005, CoMA held its ground and continued to exclude Microsoft’s Office XML schemas from its list of acceptable open formats. Undoubtedly, CoMA must have anticipated that some fireworks might follow. After all, it was saying, in effect, that unless Microsoft revised its Office products to natively support ODF by January 1, 2007, new purchases of Microsoft’s products (as well as those of any of its competitors with similar limitations) would be off-limits to CoMA’s 80,000 executive agency employees. Considering the potential direct loss of sales, plus subsequent losses from businesses and contractors that work with CoMA, and the possibility that other states or municipalities might follow suit, the stakes for Microsoft were high.
Within a month’s time, politics as usual intervened and threatened to abort the new policy. Although some of the machinations became quite personal and ugly, something quite wonderful also happened. Over the two months immediately following the final release of ETRM 3.5, the two major opposing camps (ODF supporters and Microsoft) were virtually falling over each other trying to prove a claim of being the purveyor of the most open format. Microsoft criticized Sun’s intellectual property claims to ODF and Sun quickly made an irrevocable promise to not assert its patent rights. OASIS submitted ODF to ISO (International Organization for Standardization) for approval as an international standard. Then at a summit of ODF supporters, a pledge was made to rapidly fix the accessibility problems in the format.
For its part, Microsoft announced that it would support PDF as an output format in the next release of Office, that it would submit Office XML to Ecma International, a standards body headquartered in Geneva, Switzerland (and ultimately to ISO) for approval as a standard, and then published its own irrevocable covenant to not seek to enforce its patent claims on Office 2003 XML schemas. A few weeks later Microsoft expanded the covenant to include forthcoming releases of Office XML and the standard being submitted to Ecma. It seems unlikely most of this would have happened without CoMA’s policy maneuvers.
Some of the other political fallout was not so pleasant. William Galvin, Secretary of the Commonwealth, a Democrat, and possible gubernatorial candidate who oversees the public records office, expressed “grave concerns” about the plan. Massachusetts Senator Marc Pacheco, chair of the Senate Post Audit Committee, also came out against the plan and announced hearings for the end of October, during which the ITD was accused of failing to consult with the legislature and other executive agencies. He also challenged ITD’s authority to implement the new policy. A few days later, a last minute amendment was added to an unrelated economic stimulus bill that, if approved, would strip ITD of its power to implement policy regarding IT standards and hand them over to a politically-appointed task force. A “Thanksgiving massacre” article in the Boston Globe accused one of the chief architects and prime advocates of the policy, ITD chief Peter Quinn, of financial improprieties regarding use of state travel funds. Although exonerated of any wrongdoing in just a few weeks, Quinn announced his resignation shortly thereafter, citing the untenability of his position and the attacks on his integrity.
The Role of Libraries
Though these events could have a far-reaching impact on the digital preservation activities of libraries and archives, those institutions have so far played a minor and largely peripheral role in the ongoing saga. This story has been extremely well-reported in the computing press and on websites emphasizing legal aspects of open source software (see resources, below), but it has not been closely followed in the library press.
As reported by Andy Updegrove in the Standards Blog at ConsortiumInfo.org (a superb source on information on this story, and this issue’s Highlighted Web Site), a quote in Microsoft’s press release on its decision to submit Office XML to Ecma from Adam Farquahar, head of e-Architecture for the British Library, described the company’s decision as “an important step forward for digital preservation [that] will help us fulfill the British Library’s core responsibility of making our digital collections accessible for generations to come.” It should be noted that just a few weeks earlier Microsoft had announced a commitment of $5 million to digitize 150,000 books, two thirds of which would come from the collection of the British Library.
Less than a week after the press release emerged, Farquahar was quoted in an interview doing what could be regarded as backpedaling on his previous Office XML endorsement. “Some people think we are adopting Microsoft formats as our standard for digital preservation. This is not right; we are striving to make sure that content we receive in MS formats will be preserved.” He continued: “What format will we deliver? We deliver a lot of articles and in many formats. We deliver content in PDF, Office Open, ODF, TIFF —whatever format the customer wants.”
On December 12, 2005, five US national library associations (the American Association of Law Libraries, the American Library Association, the Association of Research Libraries, the Medical Library Association, and the Special Libraries Association) belatedly released a statement applauding CoMA’s commitment to ODF and urging other states to follow its lead.
During the CoMA hearings on the ETRM v.3.5, counsel to ITD was asked to prepare a brief justifying the legal basis for the new policy. Early on in the lengthy brief produced by Linda Hamel, General Counsel for CoMA’s ITD, an attempt was made to bolster the cachet of ODF by providing evidence of its already solid footing in the library and archives community: “The standard ... has been adopted by the Library of Congress for electronic record preservation, and is being used by the National Archives and Records Administration in their Electronic Records Archive project.” (Note: we have been unable to verify the claims of current use of ODF by either LC or NARA.)
Where is This All Leading?
The day after Microsoft announced that it would submit the Office XML schemas to Ecma and ultimately ISO for consideration as an international standard, CoMA’s Administration and Finance Secretary released the following statement: “The Commonwealth is very pleased with Microsoft’s progress in creating an open document format. If Microsoft follows through as planned, we are optimistic that Office Open XML will meet our new standards for acceptable open formats.” Some concluded this was an indication that the CoMA administration’s support for ODF was softening. However, the January 30, 2006 press release announcing Peter Quinn’s successor as ITD chief suggested support was firming again: “[Louis] Gutierrez will be responsible for overseeing the final stages of implementation of the state’s new Open Document format proposal, to go into effect in January 2007.” To some degree, it may be an issue of who blinks first. Microsoft has not completely ruled out the possibility of supporting ODF within its Office suite, and CoMA would rather that Microsoft concede on this point, but Microsoft hopes the steps it’s taking will be deemed sufficient.
At this point, it is still unclear exactly how CoMA will proceed. As of this writing, the amendment stripping ITD of its policymaking powers is still in play. Oddsmakers think the amendment will be removed, but the politics are tricky. If it survives the final vote on the economic stimulus bill, it could spell the end of the new policy. Only time will tell.
Ultimately, digital preservation is a process of risk management. Much of the digital preservation activity being carried out today could be classified as reactive—that is, building procedures, tools, and systems designed to cope with the risks in the current landscape. A true transformation of the landscape that reshapes the entire risk profile requires a more proactive approach. Unfortunately, actions that challenge the status quo and entrenched interests carry high risks themselves and often face serious “chicken and egg” obstacles. Indeed, one of the CoMA plan’s main architects lost his job, his former department may lose much of its decision-making authority, and some of the heaviest criticism of the Massachusetts ODF initiative has focused on the fact that “nobody else is requiring its use.” But if no one in a position of power is prepared to step up and take a chance, we never discover what degree of change is possible.
Even if CoMA takes no further action in this arena, its actions have already had a salutary effect on the hegemony of proprietary file formats. Software and data file format standards are driven by the activities of large volume purchasers. CoMA has shown the way for users to empower themselves. Even its mistakes, such as not involving all stakeholders early on, have been instructive. The lesson of influence through internal policy making as opposed to failure through litigation will not be lost on others.
Chronology of Events
|
May 18, 1998 |
CoMA (Commonwealth of Massachusetts) joins 18 other states, D.C., and the Federal Government in a massive antitrust suit against Microsoft. |
|
Nov. 6, 2001 |
The US Justice Department and nine states settle with Microsoft. |
|
2002 |
Nine states (including CoMA) and D.C. ask the court for stiffer penalties against Microsoft. |
|
Nov. 1, 2002 |
The District Court rejects most of the remedies requested by the states rejecting the original settlement. |
|
July 16, 2003 |
CoMA alone among the states appeals the District Court’s decision. |
|
Sept. 29, 2003 |
The CoMA legislature adopts a new policy favoring open-source software and adherence to open standards in government computing systems. |
|
Dec. 3, 2003 |
Microsoft publishes a patent license for its Office 2003 XML Reference Schemas granting royalty-free use of its schemas “solely for the purpose of reading and writing files that comply with the Microsoft specifications for the Office Schemas.” |
|
Jan. 13, 2004 |
CoMA’s Information Technology Division releases v.1.0 of its Enterprise Open Standards Policy. |
|
June 30, 2004 |
The Appeals Court rejects CoMA’s request to overturn the District Court's settlement with Microsoft by exacting harsher penalties and forcing Microsoft to improve interoperability and release more of its code as open source. |
|
June 30, 2004 |
The Government Open Code Collaborative is launched with the CoMA Information Technology Dept. as a charter member. |
|
Jan. 15, 2005 |
Eric Kriss, CoMA Secretary of Administration and Finance speaks before the Massachusetts Software Council annual meeting and lays out a strong case for support of open formats based on the need for public documents in electronic form to remain accessible for long periods of time. |
|
Jan. 27, 2005 |
Microsoft modifies and clarifies the patent license for its Office XML schemas, apparently as a result of discussions with CoMA officials concerned about some of its terms. |
|
March 22, 2005 |
CoMA posts version 3.0 of the Information Technology Division’s Enterprise Technical Reference Model, which identifies Microsoft’s Office XML as meeting its criteria for an open format (along with RTF, plain text, html, PDF, and ODF—pending approval by OASIS). |
|
May 23, 2005 |
OASIS members approve OpenDocument format v.1.0. The Microsoft representative to OASIS doesn’t cast a ballot. |
|
June 1, 2005 |
Microsoft announces that it is adopting XML technology for the default file formats in the next version of Office, code-named Office 12. “The new file formats, called Microsoft Office Open XML Formats, will become the defaults for the ‘Office 12’ versions of Microsoft Office Word, Excel, and PowerPoint.” |
|
June 9, 2005 |
CoMA holds an “Open Formats Summit” which includes representatives from OASIS, Sun, IBM, Microsoft, and HP at which it asserts its legal authority to adopt electronic format standards for Executive Department agencies and tries to formalize what is meant by the term “open” relative to standards and formats. |
|
Aug. 29, 2005 |
Massachusetts releases a draft revision of its Enterprise Technical Reference Model for public comment. The revision whittles down the list of data formats qualifying to meet the CoMA’s definition of “open” to two: ODF and PDF, effectively withdrawing approval of Microsoft’s Office XML schemas. Over 150 comments from around the world are received. About two thirds are positive, but amongst the negatives is a significant group from visually impaired users and advocacy groups for the disabled. |
|
Sept. 8, 2005 |
Alan Yates, business strategy general manager for Microsoft’s Information
Worker Product Management Group submits a 15 page comment on the new draft ETRM policy, detailing concerns and asking CoMA to reinstate Office XML as a qualified standard under its criteria for openness |
|
Sept. 21, 2005 |
CoMA officially upgrades its Enterprise Technical Reference Model from version 3.0 to 3.5. The only change is to the Information Domain, specifying a requirement that all executive branch agencies begin utilizing either ODF 1.0 or PDF 1.5 by Jan. 1, 2007. Despite pleas from Microsoft, Office XML does not make the cut. |
|
Sept. 22, 2005 |
Brian Jones, Microsoft’s Program Manager for its Office suite, writes in his blog questioning just how open the ODF standard is, pointing to apparent intellectual property claims to it by Sun Microsystems. |
|
Sept. 29, 2005 |
Sun Microsystems releases a revised Patent Statement on OpenDocument, making an irrevocable promise to not assert its patent rights against developers and users. |
|
Sept. 30, 2005 |
OASIS submits OpenDocument Format to the ISO/IEC JTC1 (International Organization for Standardization International Electrotechnical Commission’s Joint Technical Committee) for approval as an international standard. |
|
Oct. 1, 2005 |
Microsoft announces that it will support PDF as an output format in Office 12, the next major release of its office suite, expected to be released in the latter half of 2006. |
|
Oct. 25, 2005 |
It is revealed that CoMA Secretary of the Commonwealth William Galvin, administrator of the state’s records office, has “grave concerns” about the proposed transition to open formats. Another opponent of the move, CoMA senator Marc Pacheco, announces he will hold hearings on the proposed policy. |
|
Oct. 31, 2005 |
CoMA senator Marc Pacheco, chair of the Senate Post Audit Committee, holds hearings looking into how the Information Technology Dept. (ITD) arrived at its decision. During the hearing, Alan Cote, CoMA supervisor of public records, warns that proceeding with the transition to OpenDocument “may very well result in many electronic records being lost or destroyed.” The ITD’s legal counsel is asked to prepare a brief substantiating the legal basis on which the department decided it could make the new policy without consulting other branches of government or complying with public notice provisions. |
|
Nov. 2, 2005 |
A CoMA economic stimulus bill is amended at the last moment to include text that establishes a new task force that would strip the Information Technology Department of the power to make policy decisions regarding document standards. |
|
Nov. 4, 2005 |
At a well-attended summit meeting in Armonk, NY, ODF supporters including Sun Microsystems, IBM, Adobe, Corel, Computer Associates, and Google pledge to eliminate accessibility problems in ODF before the CoMA January 1, 2007 deadline. |
|
Nov. 16, 2005 |
CoMA ITD legal counsel Linda Hamel submits a brief in response to the request on Oct. 31 from senator Marc Pacheco defending the process leading up to the policy implementation. |
|
Nov. 22, 2005 |
Microsoft submits the Office XML schemas to Ecma International for consideration to be adopted as international standards. The move is co-sponsored by Apple, BP, the British Library, Intel, Toshiba, and others. |
|
Nov. 23, 2005 |
Microsoft publishes an irrevocable covenant that “it will not seek to enforce any of its patent claims necessary to conform to the technical specifications for the Microsoft Office 2003 XML Reference Schemas” on its website. |
|
Nov. 23, 2005 |
CoMA Administration and Finance Secretary sends out a press release: “The Commonwealth is very pleased with Microsoft’s progress in creating an open document format. If Microsoft follows through as planned, we are optimistic that Office Open XML will meet our new standards for acceptable open formats.” |
|
Nov. 26, 2005 |
A Boston Globe article raises questions about the propriety of travel expenses by CoMA ITD chief Peter Quinn, suggesting that trips taken to promote the Commonwealth’s new document format policies may have been partially paid for by companies who stood to gain from the decision. The governor’s office promises to investigate. |
|
Dec. 9, 2005 |
Ecma International creates a technical committee to produce a formal standard for office productivity applications that is fully compatible with the Office Open XML Formats, submitted by Microsoft. |
|
Dec. 10, 2005 |
The investigation into CoMA ITD chief Peter Quinn’s travel expenses ends with his complete exoneration. |
|
Dec. 12, 2005 |
Five national library associations (the American Association of Law Libraries, the American Library Association, the Association of Research Libraries, the Medical Library Association, and the Special Libraries Association) release a statement addressed to CoMA Administration and Finance Secretary Thomas Trimarco in support of CoMA’s commitment to OpenDocument. |
|
Dec. 13, 2005 |
Microsoft publishes an FAQ on its website, clarifying some points regarding the specific terms of its covenant not to sue, including making explicit that the terms apply to the forthcoming release of Office and to the standard submitted to Ecma, once it is approved. |
|
Dec. 27, 2005 |
Peter Quinn, CIO (Chief Information Officer) for the CoM resigns, effective January 12, 2006. |
|
Jan. 6, 2006 |
CoMA appoints Bethann Pepoli as acting chief information officer, while announcing that it has no plans to change its policies regarding adoption of open formats. |
|
Jan. 30, 2006 |
CoMA Administration and Finance Secretary Thomas Trimarco names Louis Gutierrez as chief information officer of the Information Technology Division (ITD), effective on February 6, 2006. According to the press release, “Gutierrez will be responsible for overseeing the final stages of implementation of the state’s new Open Document format proposal, to go into effect in January 2007,” thus reaffirming the CoMA's commitment to the policy. |
Resources
Top Sources and Summary Information
Andy Updegrove’s ConsortiumInfo.Org Standards Blog
The Massachusetts ODF-MS XML Timeline/Resource Page from Groklaw.net
David Berlind's Blog at ZDNet
Massachusetts and OpenDocument: A Brave New World Sept. 2005
The Future Is Open: What OpenDocument Is And Why You Should Care Jan. 30, 2005
OpenDocument at Answers.com
Written and Audio Transcripts of Important Events
Eric Kriss Speech on Open Formats Jan 14, 2005 (written transcript with links to audio)
Hearings before Sen. Marc Pacheco and the Post Audit Committee, Oct. 31, 2005 (audio)
Hearings before Sen. Marc Pacheco and the Post Audit Committee, Oct. 31, 2005 (written)
MA Open Forum on the Future of Electronic Data Formats for the Commonwealth Dec. 14, 2005 (audio)
Statements on CoMA’s ODF Policy
Public Comments received by CoMA’s ITD (Aug/Sept 2005)
Comment from Five US National Library Associations Dec. 15, 2005
Peter Quinn, former CoMA ITD chief tells his story Feb. 10, 2006
CoMA/ITD Policy Documents and Statements
March 2005 draft of ETRM Information Domain
September 2005 final version of ETRM 3.5 Information Domain
FAQ on ETRM 3.5
CoMA press release following Microsoft ’s Ecma submission announcement Nov. 23, 2005
Opposition to ITD’s ODF Policy
Mass. officials criticize OpenDocument decision Nov. 1, 2005
Mass. bill endangers OpenDocument decision Nov. 3, 2005
Galvin attacks software proposal Oct. 25, 2005
Microsoft Lines Up Politician Support In Mass. Format Battle Oct. 25, 2005
The Fix is in on ODF Nov. 2, 2005
Senators questions file-storage shift: Blind workers say change will make it hard to do their jobs Oct. 29, 2005
Patent and License Statements, Standard Submissions and Approvals and Other Standards Information
Office 2003 XML Reference Schema Patent License January 27, 2005 version
Office 2003 XML Reference Schema Patent License December 3, 2003 version
Microsoft Covenant Regarding Office 2003 XML Reference Schemas Nov. 23, 2005
New Covenant vs. Old License for Office 2003 XML Reference Schemas
Ecma International creates TC45 to standardize Office Open XML File Formats Dec. 9, 2005
Microsoft Offers Office Document Formats to Ecma International for Open Standardization Nov. 22, 2005
Ecma International Standardization of OpenXML File Formats Frequently Asked Questions
Microsoft “Office 12” XML File Formats to Give Customers Improved Data Interoperability and Dramatically Smaller File Sizes
Adobe patent clarification notice: Reading and writing PDF files 2003
Members Approve OpenDocument as OASIS Standard May 23, 2005
Sun OpenDocument Patent Statement Sept. 29, 2005
OASIS Open Document Format for Office Applications FAQ
OASIS submits OpenDocument to ISO as standard Oct. 10, 2005
Is ODF an Open Standard? Feb. 9, 2006
Format Comparison between ODF and MS XML Nov. 25, 2005
Microsoft’s Format Covenant Fails Comparison Test with Sun’s Nov. 22, 2005
Microsoft's XML Patents and Patent-seeking
So, Could Microsoft Ever ‘Own’ XML? Feb. 13, 2004
Microsoft Files for Patents Related to XML Parsing and Word Processing
Microsoft seeks XML-related patents Jan. 23, 2004
Microsoft: XML patent moves are no big deal Jan. 26, 2004
Microsoft slammed over XML patent May 26, 2005
Microsoft defends its patents May 27, 2005
Notes 1. Erratum: Clarification note added February 28, 2006: Original text said “a major newspaper claiming improper use of state funds by a high government official.”
 |
 |
 |
 |
 |
 |
 |
 |
 |
Calendar of Events |
|
 |

 |
 |
 |

VRA*24 Annual Conference March 6-11, 2006 Baltimore, Maryland
The annual conference of the Visual Resources Association (VRA) of image management professionals working in educational and cultural heritage environments will include seminars, user group meetings, and workshops such as “Digital Copystand for Dummies: A Real Life Workshop for the Rest of Us!”
The 3-D’s of Preservation: Disasters, Displays, Digitization March 8-10, 2006 Paris, France
Sponsored by the Bibliothèque nationale de France and the International Federation of Library Associations and Institutions (IFLA), this symposium will address three leading concerns in preservation: diasters, displays, and digitization.
Scholarship and Libraries in Transition: A Dialogue about the Impacts of Mass Digitization Projects March 10-11, 2006 Ann Arbor, Michigan
This symposium, to be presented by University of Michigan University Library and the National Commission on Libraries and Information Science, will examine the implications and professional, social, and economic issues associated with mass digitization initiatives. A webcast of this event will be available.
Digital Preservation Training Programme March 20-24, 2006 Birmingham, England
Registration is now open for the second Digital Preservation Training Programme. This intensive, weeklong residential course will cover topics such as:
- Planning and strategy
- OAIS: initiatives and tools
- Obsolescence
- Metadata
- Costs and risk management
Museums and the Web 2006 March 22-25, 2006 Albuquerque, New Mexico
The annual conference will use a variety of presentation and networking formats to review, analyze, and discuss social, design, technological, economic, organizational, and cultural issues of the on-line presence of culture and heritage.
DSpace User Group Meeting April 20-21, 2006 Bergen, Norway
This meeting will focus on institutional implementation of the DSpace repository system including how DSpace is used and embedded in existing systems. The format will feature long and short presentations, discussion groups, poster sessions, and tutorials. A separate Institutional Repository workshop will be held just prior to the workshop on April 19, 2006, which will address institutional repository policy, advocacy, and open access issues. LIFE Conference April 20, 2006 London, England
The JISC-funded LIFE project will report its findings on the costs to manage, store, and preserve digital collections at this one-day meeting. The LIFE project (Life Cycle Information for E-Literature) is a one year study about the lifecycle management digital collections. Although the project targeted specific digital collections at the University College London Library Services and the British Library, convenors hope the results will provide practical information for all institutions that are collecting and preserving digital material.
Digital Preservation Management: Short-Term Solutions to Long-Term Problems May 14-19, 2006 Ithaca, New York
Cornell University Library is pleased to announce continuation of the Digital Preservation Management workshop series. This limited enrollment workshop has a registration fee of $750 per participant. Registration opens March 1 for the May workshop. The keynote speaker will be Eileen Fenton, Executive Director of Portico. Additional offerings of the workshop will be held in July and October 2006.
An Expedition to European Digital Cultural Heritage: Collecting, Connecting - and Conserving? June 21-22, 2006 Salzburg, Austria
This international conference is geared to fostering discussions about the i2010 initiative for a European Digital Library. Sessions will cover collecting, connecting, and conserving digital cultural treasures and scientific information.
 |
 |
 |
 |
 |
 |
 |
 |
 |
Announcements |
|
 |

 |
 |
 |

Digital Preservation: Managing Digital Objects Discussion Group
The new Digital Preservation: Managing Digital Objects Discussion Group conducted its inaugural meeting at the American Library Association’s Midwinter Meeting in San Antonio, TX. Lars Meyer, Emory University, and Robin Dale, RLG, are co-chairing the group. PRONOM PUID Scheme Released
The National Archives is pleased to announce the publication of the PRONOM Persistent Unique Identifier (PUID) Scheme, “an extensible scheme for providing persistent, unique and unambiguous identifiers for file formats held in the PRONOM technical registry.” Over 130 formats have been assigned a PUID and others will be added on an ongoing basis.
VRA Core 4.0 Beta
The Visual Resources Association (VRA) Data Standards Committee has updated its Core Categories metadata elements in order to conform to ongoing developments in data standards, data sharing, and data storage technology. VRA Core 4.0 Beta Draft has been released for review and comments. This version updates the previous version to achieve XML compliance.
Warwick Workshop Report on Digital Curation and Preservation
The final report and recommendations of the November 2005 Warwick Workshop entitled “Digital Curation and Preservation: Defining the research agenda for the next decade,” is now available online. Topics covered in the report include: curation services and technologies, drivers and barriers (policy issues), and data lifecycle management (process issues). Global Digital Format Registry
Harvard University Library has been awarded a Mellon grant to create a Global Digital Format Registry (GDFR). The GDFR will develop and maintain information about digital file formats—information necessary for management of digital files over time. The grant is for a two-year project and will start in February 2006.
CLOCKSS
The LOCKSS Alliance has announced the start of CLOCKSS: Controlled Lots of Copies Keep Stuff Safe. This new initiative will employ the LOCKSS model (Lots of Copies Keep Stuff Safe) in a two-year project designed to test the ability of the technology to perform as a fail-safe dark archive. Materials collected in the archive will be preserved and opened for access only in the event of a disaster that renders the content “orphaned” as deemed by a joint advisory board comprised of societies, publishers, and libraries. (See the Editors’ interview in this issue for more details.)
OpenDOAR
The Directory of Open Access Repositories has announced the release of its primary listing of open access archives. The OpenDOAR project aims to build a picture of the world-wide development of open access repositories, “…working to classify these and produce information for search-providers, funding agencies and others, which will benefit scholars and researchers around the world.”
 |
 |
|
 |