For the Record: Assessing the Impact of Archiving on the Archived
Author: Edgar Crook - National Library of Australia (ecrook@nla.gov.au)
Introduction
PANDORA, Australia’s Web Archive at the National Library of Australia (NLA), has been archiving Web-based publications for 10 years, in conjunction with participants at the Australian State Libraries and other cultural organisations, including the Australian War Memorial, National Film and Sound Archive, and the Australian Institute of Aboriginal and Torres Strait Islander Studies. There are approximately 12,000 titles within the Archive; each title may be a single discrete document or a whole government website containing thousands of pages.
Many studies and articles examining archival practice and policy have emanated from PANDORA. None, however, has attempted to gauge the effect of archiving on the archived—that is the publishers and their publications.
This study examines publisher behaviour and attitudes in relation to Internet archiving. Data for the study was obtained by various means. The NLA placed an online survey on their website and invited the 4920 people who had given permission for PANDORA to archive a resource between 1996 and May 2005 to complete it. The May 2005 cut-off date was used so that information would be received from those whose work had been archived for more than one year. To complement the survey, a selected range of archived material was examined to discover publication patterns pre- and post-archiving. A small number of electronic resources that were not archived by PANDORA, but which had been archived in the Library’s Whole of Domain Harvest or by the Internet Archive, were compared with items archived by PANDORA. In this way a sample of knowingly archived and unknowingly archived items was available for comparison. An analysis of published comments appearing on archived websites was also undertaken.
There are a number of Internet archiving projects currently gathering websites for preservation. Most of these—and the largest, the Internet Archive—do so in general without the express consent or knowledge of the Web publisher. As such, the Web publisher does not automatically know that a copy of their publication is in existence elsewhere and that what they are producing will potentially have a much longer life than may have been intended. However, this is not the case with PANDORA, as it is one of the few archiving projects that explicitly seeks permission from publishers before archiving and notifies them post-archiving. This study, then, queries only those knowingly archived publishers.
The PANDORA Publisher’s Survey
Material produced for the Internet is generally still not afforded the respect that is garnered by traditional print publishing: it is often not subject to peer review, it is frequently perceived to contain unreliable information when compared to print publications, and it is often hard to rank. Usage statistics are one means of defining quality and usefulness, but popularity does not always indicate quality, reliability, or link stability.
One way that an Australian Internet publication can receive permanence and recognition is by being invited to be archived in PANDORA. PANDORA is a selective archive: Web publications found within PANDORA have been chosen by staff using explicit selection criteria, giving many the impression that the items archived are in some way different and more significant than those not archived. The PANDORA form letter, which is sent out to publishers when initially asking for archival permission, includes the claim that the desired item has both “lasting cultural value” and “national significance.” These compliments clearly resonate with many publishers and seemingly convey a perception of recognition and formal imprimatur. Some publishers have quoted these letter in their publications, seeking to make their audience aware of the Library’s estimation of their publication. One publisher of an online novel has even used the excerpted sentence to make it seem like a positive review.
Figure 1. Effect on public perception.
Many publishers are therefore very happy to be archived. When asked in the survey whether PANDORA archiving was worthwhile, 97% said that they thought it was; 96% also thought that archiving had been a positive thing for their publication. Conversely, the survey also showed that, prior to our first contact requesting archival permission, just over 52% of publishers had not heard of the Archive. And once aware of the Archive, only 35% had ever used it to view any other website. Interestingly, 29% of publishers also believe that it is improbable that PANDORA will preserve their publications in the long term.
Figure 2. Effect of archiving.
Figure 3. Role of Pandora as a back-up strategy.
The majority of publishers also did not appear to rely on the Archive as a back-up of their publications, indicating that they are either unaware of the importance of back-ups or that they are in organisations that have risk management strategies in hand. On occasion, however, we have been able to provide copies of content to publishers that have suffered serious problems with their computers or networks and have lost the content on their own websites. Another service that PANDORA provides for some publishers of websites and online journals is the ability for them to point to our Archive for past issues of their publication, so that they do not have to host them themselves, presumably saving on their storage and maintenance costs.
Survey Findings
Some publishers have worried that a “light” archive, which makes material publicly accessible, would draw some readers away from their own websites, and most of the sites in the Archive are still available from the publisher.The PANDORA Archive does receive a relatively large number of hits: the usage in 2004-2005 was 5,390,459 page views. The archived sites most frequently accessed in PANDORA are almost invariably websites that are no longer available on the live Web.
Figure 4. Impact of PANDORA on website hits.
The PANDORA survey asked Web publishers if they believed that archiving had affected the number of hits to their publications. Sixty five percent said that it had not, and of the 34% who said archiving did have an effect, 92% said that it had been positive. Caution should be taken, however, when extrapolating these results to all archives. PANDORA may lead to increased hits and usage of live websites only because it actively attempts to do this as a reciprocal gesture for publishers who participate in the Archive. Every title within PANDORA has a Title Entry Page (TEP) that serves as the first point of entry, and on this page is a link to the live site. PANDORA also uses a pop-up window to inform users that they are entering an archive and not the live site. This pop-up window appears when users enter the Archive from a link that is not within the National Library Web domain. Robots exclusions have also been used to direct search engines to deliver the TEP in their search results rather than a direct link to lower level pages. These activities, especially the active link from the National Library of Australia, as well as our metadata on the TEP and the individual catalogue records created for each title on Libraries Australia, all raise the visibility of the live resource to a marked degree.
Blogs
The National Library began a concerted effort to archive blogs in May 2005. Prior to that date, it had previously selected and archived the blogs of some notable Australian politicians and journalists. This archiving was not done because the medium was a blog, but rather because the individuals were high profile and we wanted to capture the online-only adjuncts to their traditional media output.
Beginning in mid-2005, the NLA began to archive a representative sample of blogs to document this popular means of communication and personal publication. Bloggers in particular were very enthusiastic about being archived and consequently many documented the experience on their blogs. One young man wrote an online letter for future youth; another considered the effect on their publication habits thus:
One of the weird things is that I’m going to need to resist the urge to be more self-conscious, now that I know that my words here will be preserved in this manner. I feel kind of inspired to keep this blog going and may end up trying to improve the quality and quantity of my posts, which is a good thing … I guess ;) [1]
Another was less impressed writing that, for posterity:
...you’ll still be able to get your fix of ill-informed commentary, shilling for gambling sites and hot Asian chicks. And it’s all thanks to you, the Australian taxpayer. I would have preferred they just gave me one of those $50,000 grants, but beggars can’t be choosers.[2]
To what extent should the NLA be concerned about the effect of archiving on the documentary record? There is some justified fear that informing a content provider that what they produce will be recorded and made available for the long-term may tend to influence what they produce. Thus, we might create the “observer effect” whereby things are changed merely by the fact of observing them. Happily, from a comparison of blogs both before and after archiving by PANDORA, it is possible to see that archiving does not appear to have affected the content. Bloggers, though they may consider it in the short-term after initial archiving (and have commented thus), do not appear to continue self-consciously writing for a possible future audience. Instead, they concentrate on the immediate and the quotidian. From a brief textual analysis, there appeared to be no evidence that archived bloggers censor themselves any more than they did prior to being archived, a result which was confirmed by the general survey responses. The bloggers who discuss sexual, political, and personal information continue to do so, and where there is no illegality, the Library takes no role, except to restrict some archived websites and blogs to adult researchers.
Figure 5. Effect of archiving on self-censorship.
From studying a set of both archived and un-archived blogs, there also does not appear to be any changes in blogs’ longevity by being archived. The author’s personal circumstances, available time, and “having something to say,” seem to be far greater determinants of a blog’s longevity than any effect of archiving.
E-Journals
E-journals are journals that are published only in online form; they can emanate from any source. In Australia they are most widely used by government and academic, rather than by commercial, publishers. The National Library and PANDORA are often involved at the outset of these publications, since the first task of a newly created serial publication is often to apply to the Library for an ISSN. Built into the ISSN application form is a PANDORA archival notification clause. PANDORA, therefore, is often able to archive many Australian online journals from inception.
That archiving in PANDORA significantly improves quality or maintains publishing life cannot be proved from an analysis of archived serials. The survey asked publishers whether archiving increased submissions, comments, or other contributions. Publishers mostly reported that they had experienced no change, but where there was change, it was predominately positive. There also appeared to be no evidence that archiving prolonged publishing life.
Figure 6. Effect of archiving on submissions and comments.
Figure 7. Effect of archiving on citation rate.
There was some indication that some serials had increased citation rates as a result from being archived. Publishers’ perception of the usefulness of our creation of Persistent Uniform Resource Indicators (PIs) for their publications was very low, however. Only 14% of survey respondents believed that creating PIs for their publication had any benefit. However, we are aware that a very large number of links to the Archive are made by indexing agencies using PIs, since PANDORA has ongoing relationships with a number of indexing agencies (who furthermore actively advise us on the selection decisions in their specialist areas). The lack of knowledge of PI usage is possibly due to the fact that the links point to the Archive and not the live resource.
Commercial Websites
Most websites within the Archive exist to disseminate information, and many publishers welcome PANDORA’s role in promoting that further. Commercial websites differ in that they exist to gain revenue from users. Diffusing their customers has the potential to cost them revenue. It was interesting to note from the survey results that publishers of commercial websites say that they do not appear to be detrimentally affected by being archived.
There are three major ways in which an archive could have an economically damaging role on a live website. The archive may take away hits, leading to less page-view-based advertising revenue and possible click-through revenue. Archives may also display materials that normally are only viewable at a cost, such as a commercial subscription online journal. The third potential problem could be user confusion, whereby a user unknowingly accesses the archive rather than the live website and attempts but fails to complete a purchase, leading to dissatisfaction with the company and loss of potential revenue for it.
PANDORA has made efforts to not interfere with commercial websites’ activities. If commercially available material is archived, there is prior negotiation with publishers to restrict access for a publisher-specified period. PANDORA also tries to make sure that users are aware that they are in the Archive and minimizes sales confusion by not archiving or allowing any transaction pages or functions and by using re-directs to point to the live website.
Consequently, PANDORA has limited effects on commercial activity. The survey results showed that this policy seems to have worked since only 1% of commercial publishers believe archiving has had a negative impact.
Figure 8. Effect of archiving on revenue.
Conclusion
This study does not seek to give an opinion on how publishers perceive all Internet archiving projects. If, however, a known, openly searchable archive, such as PANDORA, has few problems with publishers, then a less searchable (deep Web) whole domain or larger archive should present even fewer problems for publishers.
The National Library has a statutory duty to collect and preserve Australia’s documentary history, and born digital publications are no exception. To successfully create an archive that encompasses the broadest range of publications requires the ongoing consent of publishers. As long as there are archived copies and live websites, it will remain the responsibility of archives to make sure that their activity does not adversely affect online publications—both their content and their commercial value. In the long term, many publishers’ websites will no longer be available outside of the Archive. However, the archived publications will still be protected by copyright for years to come, and, therefore the Library will need to continue to have the publisher’s consent to make material accessible.
The results of the study show that PANDORA archiving has thus far not had a detrimental effect on publications, and is in fact mostly benign and in some cases beneficial. It is to be hoped that the knowledge that Internet archiving does not necessitate any conflict between archivists and publishers will assist in guiding future negotiations.
What options are available to a small scale collecting repository when the core documentation in its primary subject area is no longer created in traditionally manageable formats? How well do traditional methods for appraising institutional records, which were developed in the context of stable, structured organizations, adapt to increasingly distributed, dynamic organizations whose records are primarily born-digital? For a collecting repository whose subject area is high technology, the problem feels particularly acute: the irony of trying to capture adequate documentation of developments in information technology in paper only is ever present.
The Project
These questions were at the core of a collaborative project, funded by the National Historical Publications and Records Commission and administered by the University of Minnesota’s Charles Babbage Institute (CBI) Center for the History of Information Technology between 2003 and 2005.[1]
In this article, we describe a few of the methods, findings, and ideas for further exploration generated during “Documenting Internet2: A Collaborative Model for Developing Electronic Records Capacities in the Small Archival Repository.”
The immediate context of the project was an increasing level of concern about the future of archival collecting at CBI. Established in 1979, CBI is arguably the world’s pre-eminent repository of research material on the history of information technology. CBI’s collection strengths follow the development of computing, from the calculators of the Burroughs Corporation in the nineteenth-century, through the first electronic digital computers after World War II, the emergence of software products distinct from hardware, the growth of the computer and software industries, and the advent of networking. The collections include personal papers, institutional records of various kinds, oral histories, product literature, and rare publications. By far the largest collections are corporate records and records of professional organizations, all of which are in traditional formats. Over the years, CBI archivists had, with some regret, declined offers of digital materials. These decisions were by and large not too painful: most of the materials offered were of uncertain research value, or raised proprietary issues, or were in obsolete formats that would have required prohibitively expensive reformatting before we could even begin to address the larger issues. Other collections offered to CBI were from defunct companies or organizations that could offer no assistance to help us understand the content of inadequately labeled tapes and discs. Finally, CBI simply didn’t have the infrastructure to support basic preservation of electronic records, much less to commit to providing long-term access. This state had become untenable, though, by the late 1990s, as the documentation produced in CBI’s subject area was increasingly created in digital forms with no paper equivalent and as collections of serious research interest became available.
When project planning began in late 2002, few institutions had implemented successful electronic records preservation programs, and those that had were either large corporations or governmental entities with institutional archives and records programs. These programs were not scaleable to a collecting repository such as CBI. Another key difference was that, as a collecting repository, CBI can only inherit records, not influence their production, as is often suggested in the archival literature. Finally, archivists had yet to become seriously involved in the fledgling institutional repository movement. As a result there were few models for smaller collecting repositories seeking to expand their scope into the digital realm.
Overarching project goals were twofold:
to explore the application of traditional archival appraisal methods in a digital context through hands-on experimentation with techniques and tools
to begin an assessment of institutional capacity for building a sustainable digital component into the CBI archives program, within the context of the University of Minnesota Libraries
The project team agreed on some basic principles from the start. First, this was a planning, not an implementation grant, meaning that we would explore issues and conduct experiments, but we didn’t expect to have accessioned a digital collection by the project’s conclusion. Second, our orientation was fundamentally practical, not theoretical: we were aiming for adequacy, not perfection, and we wanted to get our hands dirty. Finally, for the archivists on the project team, this was an opportunity to learn about techniques and approaches from other fields and evaluate their application in an archival setting. We were prepared to be flexible in our approach.
We hoped the project would lead to some practical application at CBI, as well as to guidelines or best practices that could be useful in other settings.
The Partners
We found the ideal object of our documentation in Internet2. Headquartered in Ann Arbor, Michigan, Internet2 is a research and development consortium of universities working in partnership with government and industry to develop and deploy advanced network applications and technologies. As an organization at once developing and deploying new technology, it fit squarely within CBI’s collecting scope. As a collaborative and distributed organization whose modes of communication were primarily born digital, it provided a promising vehicle for our exploration. Other factors contributed to the appeal of working with Internet2. As a publicly supported organization, concerns over access to proprietary records would be minimal. As a living, growing organization, it would offer us the opportunity to interact directly with records creators to better understand their culture and practices.The timing was good, too: Internet2 was approaching its tenth anniversary, and Director Doug Van Houweling and Chief of Staff Barbara Nanzig were beginning to think about how best to capture and manage their organizational records. Nanzig generously provided us access to records, access to key staff members, on-site workspace, and even took the time to travel to multi-day advisory board meetings and archival professional conferences.
The project team and board of advisors, like Internet2 members, were geographically distributed. One valuable lesson from previous NHPRC projects was that a collaborative approach and a wide range of expertise would be required. We deliberately sought partners from a variety of fields, with expertise in such areas as digital library development, electronic records program development, library information technology, historical research, documentation strategies and functional analysis, institutional archives, organizational culture and communication, and records management. Beth Kaplan and Carrie Seib led the project from the CBI end; Margaret Hedstrom and Dharma Akmon of the University of Michigan School of Information led the on-site work and experiments. Eric Celeste at the University of Minnesota Libraries worked with Akmon and Hedstrom on the experiments from the Minnesota end. The advisory board provided input at two two-day meetings during the course of the project, and each member contributed substantially.[2]
Project Methods and Activities
Archival appraisal is complex and not uncontroversial. Several theories of institutional archival appraisal underlie its practice. Reduced to its simplest level, one school of thought would have the archivist develop a thorough understanding of the organization and its history before even thinking about collecting activities. A related approach would begin with an intensive analysis of the organization’s core functions, before identifying the ways in which documentation is produced in the carrying out of those functions. Other methods would advise the archivist to jump right in with a records survey and learn about the organization later. Still another would be to let collecting be guided by the organization’s hierarchical structure. While we had our biases, we didn’t want to rule out any possibilities.
Akmon began the appraisal process with intensive research on Internet2’s history, structure, and functions. She prepared a detailed setting description that would provide much of the contextual information we needed in order to move forward and to which we would return regularly throughout the project.[3] In May of 2004, Akmon began full-time, on-site work at Internet2, with the goal of compiling information about records and record-keeping practices at the organization. Akmon did conduct a traditional records survey, but by far the most valuable source of information about records came from interviews with key staff members. The interviews provided a wealth of information on the organization’s core functions and how those translated into records, as well as insights into records creators’ record-keeping needs, practices, assumptions, and attitudes. The value of this information suggests that “soft skills,” in this case the ability to interact with records creators on their terms, may be increasingly important for archivists working with modern organizations.
One widely held assumption on the part of Internet2 leadership was that the organization’s core records were routinely captured and made accessible through two central electronic document management systems: Documentum’s eRoom and the Internet2 Document Library. A significant turning point in the project occurred when we realized this was not the case. Despite the availability, ease of use, and apparent convenience of these tools, and despite a high level of institutional commitment and technological savvy among staff members, most people relied on personal computers and the Internet2 Web space for keeping and sharing the documents they viewed as important. As we reviewed information from the surveys and interviews, we realized that in reality, the Internet2 public website functioned as the only centralized, shared, repository of core organizational content.
Experiments and Findings
Why were the document management systems in place at Internet2 underutilized? The Document Library had been designed specifically for use at Internet2 to serve as “an authoritative source of documents and other deliverables produced by Internet2 Working Groups, Advisory Groups, or Projects.”[4]
The Document Library is publicly accessible online via the Internet2 website. We inventoried its contents in February 2005 and conducted a detailed review of its features including its structure and use: the submissions process, existing guidelines, metadata requirements, and other features that could impact submissions.
Our first experiment addressed the theme of records creators’ behaviors and tested the submission model of collecting. In a perfect world, high-level mandates and compliance with records management would result in the routine capture of appropriate documentation. Given the opportunity to work with records creators at Internet2, we wanted to test staff responses to record-keeping guidelines and to learn more about why the Document Library and the eRoom did not turn out to comprise a cache of critical documentation as had been assumed.
We discovered several reasons for this. Interviews with staff revealed that the Document Library is in fact a hyperlinked citation list. Entries in the Document Library serve as pointers to documents in staff’s personal Internet2 server space. Staff members who do submit material to the library also have the ability to move or remove the documents without warning and without leaving a documentary trail. Because of this, and despite its potential, the Internet2 Document Library clearly was not living up to its intended purpose.
But what if upper management mandated submission of content to the Document Library? We wanted to know more about the depth of records creators’ willingness (or reluctance) to add another step to their daily routines and what factors might affect that level of involvement. We wanted to see how employees would respond to a very clear management directive to submit documents to the Document Library, accompanied by the rationale that, essentially, it was for the greater good of the institution. We asked a senior staff member to send a message to managers in two areas of the organization reiterating the purpose of the Document Library and encouraging staff members to submit the documents and deliverables of their areas and working groups “so that the Document Library can become the single, authoritative source for working papers, technical reports, proposals and recommendations, and any other important project documentation.”
We monitored additions to the Document Library for approximately a month after the message was sent, and during that time not a single additional document was submitted. In fact, since its implementation in fall 2003, only twenty documents had been submitted to the Document Library, and interviews with staff revealed an important reason. Submitting content to the Document Library involves filling a fairly extensive list of metadata fields based upon Dublin Core guidelines, a time consuming task with no perceived direct value to the records creators. In other words, there was no real incentive for compliance.
With this less than promising response to the “submission” model, we determined to focus next on a “capture” model that would entail little or no extra work on the part of the records creators. We knew from the interviews that staff members gravitated toward the Internet2 website for finding and sharing documents. Our next pilot project was a crawl of the website in order to capture those documents.[5] We conducted three crawls over a period of two months.
Once the crawl was processed, we had an online mirror of the Internet2 website and one that could be searched using existing desktop software.[6] Although the number of documents captured varied from one area in the organization to another (as would be the case in the paper world, as well), a surprising volume of important documentation was captured. We were able to create our own archived instance of the site that could be surfed and searched as though it were still live. The crawl stored each page as we found it while also preserving the connections inherent in Web content (the links and hierarchy), or, in archival terms, the “original order.” The idea of original order is that a document’s meaning is related to and to some extent derived from the context in which it was created or filed by the creator. Archivists conducting records surveys in the paper world are careful to capture even the most mundane of contextual information (specific location, labels, post-its, and notes on drawers) because of the meaning these bits of information can convey.[7] The ability to capture the documents in their original order while also enhancing search capabilities was one of the most exciting aspects of the project.
Some notable shortcomings were also immediately apparent. The crawl results, for example, do not provide the date of file creation or last modification, only the date the material was crawled; a significant drawback. As well, certain uses of scripts, we found, can render a Web page uncrawlable.[8]
While one particularly appealing aspect of the crawl method is that it requires minimal disruption of staff workflow and no time wasted trying to enforce record-keeping requirements, there was a learning curve, and the crawl required significant intervention. Our first crawl got hung up on Web-based calendars and its parameters had to be redefined. Further, on top of time spent on our side to administer the crawls and process the results, the richness of our crawl results was a function of Internet2’s willingness to work with us. This enabled us to capture significant documentation missing from our first crawl.
We were able to conduct the Web crawl in part because so much of the Internet2 website is publicly accessible. We believe our results would have been significantly enhanced had we been able to capture data from staff members’ personal computers. But capturing personal computer data would have raised a host of problems: many staff members interviewed acknowledged that they keep personal, non-work-related documents on their work computers, making privacy and the relevance of documents captured a concern. Further, personal computers are not only used as document storage systems, but also as work spaces, meaning that documents captured are not necessarily complete. In the context of this project, we decided that arranging to capture PC data would not have been worth the effort.
As the cost of storage diminishes, and the availability of increasingly sophisticated tools grows, we believe that the data capture model will be an important strategy for institutional archives.
Parting Thoughts
As organizations create and manage their records differently, to have any hope of preserving them or the “order” in which they are kept will take a very basic rethinking of how archivists actually get at the records they hope to collect. The theory and appraisal models discussed above provided an excellent starting point, but in the end, we needed to step outside of the box in order to actually get at the records. This also demonstrates the value of a diverse group of project participants and advisors. Had we not considered the suggestion of a librarian on the advisory board that we “save everything,” we wouldn’t have come to the realization that capturing “everything” through the crawl would not only gather content but preserve connections and context.[9]
Other, more conventional, organizations will have a much different record keeping landscape than that of Internet2, which might allow for appraisal at a more granular level. Some collections will be so large, distributed, and unmanaged that they can realistically be “processed” on only the highest level. Appraisal in this context may mean determining to do the Web crawl, but forego the email or the information on people’s personal computers. It may mean selecting which desktops to capture and which document file types are the most important in adequately capturing the documentary record of the organization. The project affirmed our bias that a thorough evaluation of the record keeping landscape in a particular organization is more critical than ever. The project also highlighted an unexpected level of similarity between decision making in the digital world and in the paper world. It may be that as archivists, we have just become comfortable with the spectrum of options when working with paper, and more willing to accept imperfection. Just as in the paper world, archivists need to work with the organization to identify options, determining what is possible and what is practical, and making decisions based on that information. In the digital context, these decisions are made more consciously and are more visible and open to scrutiny.
If we began the project hoping to create some best practices and guidelines that would help us to begin to build our program and contribute something to others, we concluded it with a deepened respect for the adage that “no one size fits all.” At the same time, our findings—the successes, the problems, the questions that arose—will be useful for other projects and at other institutions. Internet2 is perhaps an extreme example of a modern high tech organization, but in fact we are already drawing on our findings in other contexts.[10]
Notes:
[1] National Historical Publications and Records Commission grant number 2004-036. We are grateful to the NHPRC for making this project possible. [2] Arthur Norberg, CBI; Wendy Lougee, University of Minnesota Libraries; Bob Horton, Minnesota Historical Society; Joe Anderson, American Institute of Physics; Phil Bantin, Indiana University; and Barbara Nanzig, Internet2. [3] All project documents referenced here are available online at www.cbi.umn.edu/documentinginternet2/ [4] http://docs.internet2.edu/ [5] The primary technology used for the crawl was Heritrix (a Java-based crawler from the Internet Archive). Perl, JavaScript, PHP (along with a Web server that can serve the PHP), MySQL, and the assistance of several University of Michigan graduate students were used to process the results of the crawl. Detailed information is accessible at http://wiki.lib.umn.edu/DI2/HowToCrawl. The Heritrix crawler now also powers the Internet Archive’s ArchiveIt service, which simplifies focused crawling for institutions. [6] We used Apple’s Spotlight tool for this task. [7] For example, “the first three drawers of the tan file cabinets against east wall in the hallway contain background files used to support patent litigation for software product A.” [8] We used a JavaScript URL rewriting method, and while this is very functional for simple Web pages, it can fail when the crawler encounters Web pages that use scripts to construct links on the fly. [9] Of course, lurking below the surface of all of this (and outside the scope of our project) is the continuing problem of longterm preservation. Our crawl summary revealed thousands of PowerPoint, PNG graphics, XML, and PDF documents, hundreds of MS Word, RealAudio, ZIP files, and dozens of other movie formats—in all about two dozen file formats. We offer no new solution to these thorny issues, but we do have a significant body of content that we can experiment with. In the meantime, we just note here that we are committed to “bit preservation” and to capture of content in platform independent, open sources contexts whenever possible—without any real idea how to maintain access to these formats over time. [10] The lessons learned from the Internet2 project have already been applied not only at CBI but in the University Archives and Minnesota’s University Digital Conservancy initiative, which serves as the digital arm of the University Archives as well as a repository for faculty research.
At Cornell University Library’s Digital Preservation Management Workshop, we are often asked how other industries, such as corporate business and health care, are dealing with management of digital assets. Information about their practices are, in general, not as transparent as are preservation and archiving activities in the cultural heritage realm. This issue’s Highlighted Web Site, EConent, is dedicated to distributing news and perspectives on electronic content for “executives and professionals involved in content creation, management, acquisition, organization, and distribution in both commercial and enterprise environments.”
The site features categorically-organized news snippets, a newsletter, white papers, interesting editorial columns, and the EContent 100, an annotated list of “companies that matter most in the digital content industry.” The advertisements and sponsored links are unobtrusive and one does not need to register to access most of the website features.
Another site, the Association for Information and Image Management’s (AIIM) Enterprise Content Management site, also offers business-based perspectives but requires registration in order to view much of the content. The registration is free, but be sure to read the privacy policies!
FAQ
Orphan Works and Section 108 Updates
Author: Peter Hirtle - Cornell University Library (pbh6@cornell.edu)
What are some of the recent developments with regard to orphan works and section 108 that may affect digitization projects and digital preservation efforts?
This FAQ is answered by Peter Hirtle. Hirtle is the Technology Strategist for Cornell University Library’s Instruction Research and Information Services unit as well as the Library’s Intellectual Property Officer.
Orphan Works Update
In the 15 April 2005 issue of RLG DigiNews, I reported on a Copyright Office study on the “orphan works” problem. An orphan work is one whose current copyright status either cannot be identified (because, for example, it is an anonymous manuscript or uncredited photograph) or for which the reputed copyright holder cannot be contacted in order to seek permission to reproduce or distribute. Any institution that reproduces or distributes copies of the original item without permission might be liable for copyright infringement should the legitimate copyright holder step forward at a later date. The concern is that some institutions would be reluctant to digitize and/or preserve such an item in light of the potential risk.
There has been substantial progress on the orphan works issue since last year. The Copyright Office issued its final report on the matter on 31 January 2006. The report is one of the best produced by the Copyright Office. It does a tremendous job of defining the orphan works problem, discussing why it is a problem, and finally proposing a reasonable set of recommendations as well as draft legislation. The heart of the recommendations can be summarized simply:
Users are expected to conduct a reasonably diligent investigation to locate the copyright owner.
If such an investigation is done, and a copyright owner later surfaces, the user will only have to pay reasonable compensation for the use of the work—not the horrific penalties that can be associated with copyright infringement.
Libraries, museums, and other non-commercial users could avoid even those fees if they stopped using the item immediately.
The report represents a reasonable compromise approach to the orphan works issue. Some of the commentators had argued for more concrete procedures, suggesting that the definition of a “reasonable investigation” is the type of issue that could still be litigated in court and did not provide the assurances they needed. The final report recognized, however, that there can be no “one size fits all” approach to copyright investigation; an undated photograph given to you by your great aunt might require a different level of investigation than would a published item that had been commercially distributed. And it left open that possibility that user communities could develop their own best practices, depending on the nature of the material in question. On other issues of concern to the cultural heritage community, the report failed to propose action. For example, the report declined to include in the definition of orphan works those works whose owners refused to respond to permission requests. Many commentators wanted registries of copyright owners or copyrighted works, but the report decided not to include either a mandatory or voluntary registry of works in its recommendations because it might run afoul of international treaties or be unduly onerous to copyright owners. Yet in spite of these limitations, the report as a whole presented real progress in the effort to address the issue of orphan works.
The Judiciary Committee in the House of Representatives held a hearing on the report on 8 March 2006; the Senate Committee on the Judiciary Subcommittee on Intellectual Property had its own hearing on 6 April 2006. The only opposition to the report initially came from freelance photographers and graphic specialists, who worried that their work could too easily be labeled “orphan” and that the current solutions to misappropriation—namely a lawsuit—are too expensive.
A series of negotiations between the interested parties led to the introduction of legislation in the House, H.R. 5439, the Orphan Works Act of 2006; no corresponding bill has been introduced in the Senate. The proposed legislation is far from perfect. One could, for example, read it to suggest that one had to pay the Copyright Office to conduct a search of its paper records for every item that one would want to digitize (rather than just relying on online databases). In addition, while the exemption for reproduction by educational and charitable institutions is included in the law, it would not apply if the institution “has earned proceeds directly attributable to the infringement.” It is unclear if the reproduction and handling costs associated with making a copy for a patron would be considered to be “proceeds.”
In spite of these concerns, the Orphan Works Act of 2006 represents a real step forward. If it were to pass, libraries and museums could proceed with many digitization and preservation projects, knowing that all that is required of them is a reasonable investigation to locate the copyright owner. As of 7 August 2006, the bill is before the full House Judiciary Committee, but it still faces strong opposition from freelance photographers and textile manufacturers, and it is unclear if it will emerge for a full House vote.
Section 108 Study Group
Section 108 of the Copyright Act permits libraries and archives to make a limited number of copies of copyrighted works in support of scholarly research and preservation. Among the things that Section 108 allows libraries and archives to do are:
make some copies of unpublished copyrighted works for preservation
under limited conditions, make a replacement copy of a published item that is damaged, deteriorating, lost, stolen, or in an obsolete format
make copies of brief sections of textual published and unpublished works at the request of users
record and loan to researchers broadcast news program
participate in interlibrary loan programs
In addition, the section absolves the library of liability for most copying done by patrons using photocopiers.
Section 108 was developed with printed materials in mind, but it was not clear to the Copyright Office or the Library of Congress that the exemptions worked as well in the digital age. As a consequence, early in 2005 the Library of Congress appointed the Section 108 Study Group to prepare findings and make recommendations to for possible alterations to the law that reflect current technologies. Nineteen committee members, representing a broad range of interests, sectors, and expertise, have been meeting bi-monthly in order to develop recommendations on how to revise the copyright law in a manner that best serves the national interest while at the same time ensuring an appropriate balance among the interests of creators and other copyright holders and the patrons of libraries and archives. (Full disclosure: I am serving as one of the members of the Study Group.)
During its first year, the Study Group has spent much of its time discussing the preservation exceptions in Section 108. In March 2006, the Study Group held public roundtables in Los Angeles and Washington, D.C. on issues that had been under discussion. In preparation for that meeting, a detailed background document describing the issues of concern to the Group was prepared. Four major topics were discussed at the roundtables:
Eligibility for Section 108 Exceptions
Currently there is no definition of what constitutes a library or archives in the law. One suggestion is to limit the exemptions to those organizations with a non-profit mission or engaged in non-commercial activity. The issue of whether the exemption should extend to virtual libraries and archives or should include museums was also raised. Finally, it was noted that the exemption does not extend to outsourced activities, even though libraries have long relied on vendors to conduct much of their preservation work. Whether the exemptions should extend to the agents of libraries and archives, and under what conditions, was therefore also discussed.
Making copies for preservation and replacement purposes
The current law allows libraries and archives to make up to three preservation copies of unpublished works and up to three “replacement” copies of published works when certain stringent requirements are met. There is a concern that in the digital world, limiting the legal copies to three makes little sense, but it is unclear how best to modify the law. It was also recognized that allowing replacement copies to be made only when a work is “damaged, deteriorating, lost, stolen, or in an obsolete format,” as is specified in the current law, may not work for digital works. The inherent fragility of digital information may mean the replacement copying needs to be done sooner, and so the group has discussed the idea of adding concepts such as “unstable” or “fragile” to the law. Finally, the current law limits access to digital copies of preserved works to the “premises” of the library or archives. One issue under discussion has been when, if ever, off-site access to preserved works could be provided.
Creation of a new “preservation only” exception
Given the fragility of digital information, some have proposed that there be a general preservation exemption in the law that allows libraries and archives to preserve information whenever they want, but to limit access to the preserved information until certain trigger events occur. An important area of discussion has been whether this could happen while still preserving the interests of the rights holders. One suggestion under discussion has been to limit this exemption to a defined subset of all libraries and archives.
Preservation of network-accessible resources
An issue of special concern to the group has been the preservation of network-accessible resources. More and more of our culture is found on the Web, and yet because libraries and archives do not acquire copies of websites, it is unclear whether they would have the right to preserve them. The Study Group has been considering, therefore, whether there should be a special exception to permit the online capture and preservation by libraries and archives of websites or other online content and what limitations should apply to this exception.
Future work
Section 108 includes more than preservation issues; in the next few months, the Study Group hopes to examine issues surrounding copies made for patrons, both directly and via interlibrary loan, as well as areas that are currently outside of the scope of Section 108, such as electronic reserves and the relationship of the section to negotiated license terms. While it was initially hoped that a final report would be ready in 2006, the number and complexity of the issues has forced the discussions into 2007. The Study Group’s report is likely to be followed by a series of legislative suggestions and public hearings before any final legislation is developed and passed.
The DELOS Digital Preservation Cluster will hold a two-day seminar for digital preservation specialists and practitioners at the National Library of Estonia. The seminar will focus on the Cluster’s initial work, including research results and a testbed framework for digital preservation experiments. Other efforts will be presented as well, including modeling preservation activity and invited presentations library (National Libraries of New Zealand and Estonia) and archive perspectives (National Archives of Estonia).
This National Information Standards Organization workshop will feature best practices, case studies, and questions-and-answer sessions about managing complex collections.
ICDAT 2006 will focus on bridging the technology and content in digital archives. Invited speakers and technical presentations will cover topics such as:
The theme of this year’s Library & Information Technology Association Forum will be “NetVille in Nashville: Web Services as Library Services,” and the forum includes keynote and platform sessions covering digital assets management topics. The small conference format highlights networking opportunities, speaker and poster sessions, and vendor exhibitions.
The theme of the 2006 annual meeting of the American Society for Information Science & Technology will be “Information Realities: Shaping the Digital Future for All.”
The preliminary program for MCN’s thirty-fourth annual conference, “Access to Assets: Return on Investment,” is now available on the conference website. The meeting includes workshops, platform sessions, and a keynote presentation from Kenneth Hamma is Executive Director for Digital Policy and Initiatives at the J. Paul Getty Trust.
The Second International Digital Curation Conference, “Digital Data Curation in Practice,” will feature keynote speeches by Dr Hans F. Hoffmann (CERN) and Clifford Lynch (CNI) and platform session presented by an international slate of speakers.
In accordance with ISO and Consultative Committee for Space Data Systems (CCSDS) procedures, the Reference Model for an Open Archival Information System (OAIS) has entered into its five year review period. In order to reaffirm, modify, or withdraw the existing standard, CCSDS is soliciting recommendations for updates that will reduce ambiguities or improve missing or weak concepts. Specific areas for recommendations are:
Updates needed for clarification
Updates to add missing concepts or strengthen weak concepts
Identification of any outdated material
Comments must be received by 30 October, 2006 and should be submitted to: OAIS-support@delight.gsfc.nasa.gov.
Presentations are now available from the Coalition for Networked Information (CNI) and the UK’s Joint Information Systems Committee (JISC) conference held in York, England on July 6-7, 2006. Topics covered at the conference included access and preservation, user characteristics, open access, e-theses, digitization projects, and resource discovery.
The Digital Curation Centre (DCC) has created a short, online submission form for adding news and announcements about tools, resources, and documents to their website. The DCC hopes to “help promote the range of ongoing, international activity in the fields of digital curation and preservation…”
The Sydney eScholarship initiative is a new program that aims to integrate digital library, digital repository, publication, and associated advisory and business services for the University of Sydney, and it includes “…a business and strategic position from which to address issues around faculty partnerships; infrastructure; content creation, management and preservation; forms of publication and re-purposing; and sustainable data management.”
The Third International Conference on the Preservation of Digital Objects (iPRES 2006), “Words to Deeds: Collaboration in the Realm of Digital Preservation” will accept early registrations, at a discounted rate, until September 1, 2006.
Publishing Information
RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG Programs, OCLC Office of Programs and Research, and is published six times a year at www.rlg.org.
Materials in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given to use material found here for research purposes or private study. When citing RLG DigiNews, include the article title and author referenced plus “RLG DigiNews”. Any uses other than for research or private study require written permission from RLG and/or the author of the article. To receive this, and prior to using RLG DigiNews contents in any presentations or materials you share with others, please contact Robert Bollander (bolander@oclc.org ), OCLC Office of Programs and Research.
Please send comments and questions about this or other issues to the RLG DigiNews editors.
Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG Programs, OCLC); FAQ Editor: Richard Entlich; Contributor &Production: Ellie Buckley; Advisor: Peter Hirtle.
All links in this issue were confirmed accurate as of August 16, 2006.