RLG
 Feature Article 1  

Editor's Interview with Clifford A. Lynch



Clifford A. Lynch clifford@cni.org
Executive Director, Coalition for Networked Information (CNI)

CNI has long had an interest in digitization and digital preservation. How has that changed in the last seven years during your tenure as Executive Director? How do digitization and digital preservation factor into CNI’s agenda?

Basically, the issue of digitization is a pretty fundamental one, to the extent that one is concerned about material to support scholarship. Huge amounts of material that are critical to teaching, learning, and research remain still in print in paper or other physical formats that are largely inaccessible. We have certainly seen a lot of institutions doing work in this area. At Cornell, to cite just one example that I know you are familiar with, there has been lots of work on digitizing collections, producing digital content that can be exploited to advance scholarship and learning. Since the beginning, though, most digitization has been motivated at least in very large part by desires to improve access. There has been a debate about whether digitization constitutes an actual preservation strategy, as opposed to simply a way to produce a surrogate that can reduce handling of fragile originals. I believe that yes, digitization is an actual direct preservation strategy that is every bit as valid and sensible as, for example, microfilming, but this has been controversial and the preservation world is just getting comfortable about this. A recent ARL policy statement supporting digitization as a legitimate reformatting strategy is, I think, a milestone in acceptance. Digitization enables easy replication of preservation surrogates or copies in multiple locations, making them more survivable than print or film versions.

Digital preservation – by which I will mean a focus on preserving digital objects, as opposed to digitization to preserve originally physical objects – is another fundamental CNI interest for as long as I’ve been associated with it. As discourse goes digital, we’d better figure out how to ensure a lasting record or we run the risk of losing the continuity of the discourse. It’s as basic as that, and it’s significant for all aspects and spheres of discourse: personal, commercial, scholarly, public.

As for the CNI agenda and priorities for the next few years, I think that the overarching objective has to be putting effective large-scale systems to actually carry out digital preservation activities. This means attention to social, economic, legal, and organizational as well as technical aspects of the digital preservation problem. Over the years, there has been lot of focus on a magic bullet technology for digital preservation. Personally, I don’t believe one exists. We’ve seen various proposals on magic bullets, e.g., inscribing information on nickel-based storage that can be read in 10,000 years. It doesn’t get us very far when talking in terms of complex interactions and enormous databases. Emulation isn’t a magic bullet either – though I think it’s a useful tool in the toolbox of digital preservation techniques and technologies.

In the last few years initiatives like the Library of Congress-led NDIIPP have started to focus on economics and ongoing roles and responsibilities issues. These are critical for any real progress – probably the most important parts. We need technology and best practices. What’s been missing is all of our major cultural heritage institutions stepping up to say, “This is our responsibility. Yes, we have limited resources; yes, this is difficult, but we need to do it.”

We’ve had leadership from the Internet Archive in this area since the mid-1990s. The Internet Archive represents a wake up call and a bit of a rebuke to established cultural heritage institutions. National libraries are now stepping up to provide leadership. We’re seeing progress in European legislation for national libraries in recognizing their role in stewardship of digital materials in cultural heritage, e.g., in Scandinavia and the UK. I’ve already mentioned the LC NDIIP. And the broader research library community, at least in the US, Canada, and a few other nations, are developing strategies such as institutional repositories for preserving scholarly information – though I think they are going to have to get more involved in the non-scholarly cultural materials that are the raw materials of future scholarship, which is going to be a messy, expensive, and thankless but necessary task; we can’t just leave this to the national libraries.

We have a pair of “quotable Cliff” questions for you. In the Feb 2003 ARL Bimonthly Report you said, “The thinking about digital preservation over the past five years has advanced to the point where the needs are widely recognized and well defined, the technical approaches at least superficially mapped out, and the need for action is now clear.”   What is your current thinking about that quote?

For what it’s worth, I still believe it, though in hindsight I was perhaps a bit overly optimistic about how widely the needs for digital preservation are recognized. It’s getting better every month and every year, but there’s still a lot of education to do.

In my job, I talk to a lot of senior administrators in higher education institutions. One of the issues they’re concerned about is developing strategies for deploying institutional repositories, which includes making of the case for digital preservation to the Academic community.

I think that within libraries, archives, and related institutions there is a pretty good understanding that digital preservation is a problem that needs to be addressed, not just at the theoretical level, but also the practical level, and a reasonably good sense of the technical approaches needed to address it. When you get beyond our profession broadly, things are spottier. As someone said to me recently, “Data is not the plural of anecdote.”i The anecdotal view is that faculty and a broad population of thoughtful people in the consumer market are getting more clued into the fragility of digital content and key issues about continuity and preservation. The growth in personal music and photo libraries, and in the use of digital documents in all settings has led to an increase in awareness in the consumer market.

The place where there’s still an issue is to make the claim for resources. We need to get the top administrative level of universities to recognize that this is a resource problem. But university leaders have a grocery list of issues a mile long that call for resources. It takes a lot of work to make it not just an important issue, but to move it high up on the list. That needs to be balanced in the higher education setting with the reality that many libraries, because of overall financial stress, are hoping to find new money from the institution for digital preservation. It’s a tough case to make. The higher university level can come back to libraries saying, “Isn’t this one of your core functions? Why do you need new money for it? Shouldn’t this have been a core function all along?” These are painful but real ironies. Research libraries may have to support a good deal of the work they need to do in digital preservation, and more broadly the stewardship of digital scholarly resources, with very limited new funding, at least in the near term. It’s going to be hard.

Our second Cliff quote also appeared in that same issue of the ARL Bi-Monthly Report, where you are quoted as saying: “Stewardship is easy and inexpensive to claim; it is expensive and difficult to honor, and perhaps it will prove to be all too easy to later abdicate.”

The caution expressed in that quote still haunts me particularly as I talk to more institutions about digital preservation and about institutional repositories as an approach to stewardship. As little as 18 months ago, there were relatively few institutions on the repository front - only a handful. Now, repositories have become terribly fashionable. Everyone wants to claim to be building one. It’s easy to download the MIT DSpace software, put it on a machine, and load in a few terabytes of content. That misses the point. An institutional repository needs to be a service with continuity behind it. The more successful institutions are in getting faculty to count on the repository services, the more frightening it gets. Institutions need to recognize that they are making commitments for the long term.

Implementing repository services involves changing how faculty and students who will become the next generation of faculty think about stewardship of scholarly information and the responsibilities of individuals, institutions, scholarly societies, and research funding agencies in this process. I think we are actually going to see a sort of cross-hatch of disciplinary and institutional repositories come into play; collectively these will realize a part of the so-called “cyber-infrastructure” that’s going to support e-science and e-scholarship. The process of implementing, or deciding whether to implement, an institutional repository is an opportunity for the campus community to engage and reflect on these issues.

You've no doubt seen the NSF/LC and NSF-EU reports on setting digital preservation research agendas. What do you think about the status and direction of digital preservation research?

First, it’s wonderful these agencies flag it as a special research area. This gives it visibility and priority. It’s been too long buried as grocery list item 42 and as sub-items for other calls for action and broader research agendas. You know how this goes: you prepare the “real” research agenda for whatever you are talking about, and then you put this extra bullet at the end saying of course we also need research into digital preservation, and into the social, economic, privacy, and other policy issues of whatever technology you were really talking about. Elevating digital preservation to this level acknowledges that it is a valuable and significant issue in its own right.

On the specifics, I’m not a disinterested party. I was part of the planning committee for the NSF/LC workshop and I attended a portion of the workshop. The report conveys my views, not as much as if I’d been the chief author, but they’re there. I’ve argued that we should take a pretty broad view – not just of the high-end stuff (e.g., video games from emulators), but all the way through a range of organizational settings and on to how preservation affects the average person on the street. I’ve argued that we want to look at preservation not just of cultural artifacts, but also look at organizational records that need help. This is a tremendous problem that’s grown. It’s largely understood (though far from solved) on the national government level. In the United States, NARA’s done a good job at focusing on this and has some substantial resources to bring to bear. When I look at the state and local government level or at corporations large and small (not just Fortune 500), this is just a disaster area. We need a research agenda to give guidance and assistance for those who manage these valuable resources; we need to offer them affordable and implementable approaches and supporting systems. This is a place not just for analytical research, but also descriptive research. There is a poor quantification of the scale and severity of the problem. We need a time series of the scale of the problem from different aspects.

Note that preservation is a nasty place to do research. If organizations made use of breakthroughs, they could only demonstrate their success by coming back after a hundred years to check – that’s not an attractive timeframe for research.

You have pointed to the distinction between digital preservation and archiving as the basis for confusion, especially between libraries and archives. Would you elaborate on that theme?

This gets at professional bias and perspective: how archivists are trained to approach problems versus how librarians are trained to approach problems. As someone trained as neither, I am in a position of profound lack of knowledge – an ideal place from which to pontificate about this and offend everybody, particularly by overbroad generalizations. So let me try to put it this way. Librarians think in terms of a collection that’s a bunch objects: a bunch of books each of which is deteriorating. The goal is to keep the books or the content of the books alive into the future. Archivists look more at aggregate activities and processes that document institutional decision-making. Their concern is less with individual transactions objects and more with looking at whether we have captured the aggregation of documentation and context around processes, not isolated documents, but sets of evidence that provide insight into decisions and actions. Particularly in the digital world, parts of long-term archival context are hidden in nooks and crannies of technical infrastructure: directories, PKI systems, etc.

You have been using the term "deep time preservation." Would you talk a little about that?

I don’t want to take credit for the term itself. This references, for example, the book Deep Time by Gregory Benford.ii Deep time refers to such things as building a nuclear waste site for 10,000 years; what kind of signage could be applied that would last as long as the toxins?iii These are exercises about preserving across vast reaches of time. Appropriately Benford is both a science fiction writer and a physicist. Another reference on this kind of thinking is Stewart Brand’s book, The Clock of the Long Now.iv I’ve used the term “deep time” a bit less ambitiously then they, as a sort of shorthand for a particular class of questions people worried about digital preservation like to frame and debate. They ask: “How do we know if this digital thing will be presentable in a meaningful way to readers in 250 years?” Great question, but balance it with another class of questions that are equally vital, but are rather more boring and run of the mill information technology management and security issues: “how do we know the bits that comprise this digital thing will survive uncorrupted for the next two weeks? The next three years?”

When we talk about digital preservation as a serious operational activity, we need to consider both of these extremes; if we focus only on one we will fail.

In a 1999 D-Lib article, you advocated canonicalization as a means to “facilitate  preservation and management of digital information.”  Can you provide a simplified explanation of the theory behind this digital preservation strategy and describe the current status of tool development in support of canonicalization for different kinds of digital objects?

It’s not really a strategy so much as it is a tool or technique to measure and test how well your strategy is working. It’s a way of codifying a representation or rendering of an object in a standard way so that you can see how moving from storing the object in one format to storing it in another format changes the way the object looks or acts.

Again, I don’t claim to have originated the term. As I explain in the paper, I heard it first in the context of the World Wide Web consortium XML digital signature work.v I may have been the first to raise canonicalization in preservation as helpful for being specific about aspects of digital objects you want to take forward through transformations – a computable and specific way to see how well a migration strategy is working on a corpus of materials. I’m not sure how much use the approach has seen in the field. I would imagine that it would be very helpful in exploring a migration from, say JPEG to JPEG2000. I’ve seen a few emails in the last few months from people working in the area of word processing documents. They are looking at canonical forms of documents in the emerging world of new generations of word processors and the impact of importing/exporting to XML. This is a fruitful time to look at these things, I suspect.

In "The Afterlives of Courses on the Network," you made a reasoned pitch about the importance of talking about the preservation of course management systems. Are you aware of anyone who has actually started preserving these things? Do you have any subsequent thoughts about the need to preserve every streamed lecture and every student assignment from every course because it is in electronic form?

I want to be clear here that “Afterlives” was really more about information policy issues than simply about preservation – it addressed the broader questions about when and for how long courses should be retained, and who should have access to them when they are retained. For some courses, or some parts of courses, fairly short term retention may be the right answer. It’s very hard for me to believe that we need to save everybody’s Calculus I homework problems, or their essay on Hamlet from high school, in perpetuity. A few projects are looking at archiving courseware. For example, MIT is addressing dissemination issues in their Open Courseware effort but also looking at long-term retention of material in Open Courseware. There are a number of projects on a smaller scale. Most address material faculty members produce – they are videoing lectures, keeping lecture notes all of the major components. We need to look at many aspects of student-authored material and the issues around that. The essay you mentioned was helpful in triggering policy discussions about issues, but it has not yet crystallized into actual policy at any institution I know of. I have been told of several universities that have got active policy development processes going around this area, but it’s going to take a while. That’s not surprising. It’s a complicated area and it both demands and benefits from a thorough, deliberative, thoughtful consideration that brings in all interested parties within the campus community.

Weblogs have become ubiquitous, and yet their fragility is well documented. Some people would say they’re ephemeral and not to bother with them. Others would say they contain invaluable nuggets for current and future research. What is your view on the implications of blogging for cultural heritage collections and practice? How important are these materials for the historical record? What can be done about them?

Blogs are important in the same way that websites are. Blogs really do range from the sublime to the ridiculous. They can be wonderful sources of thinking, analysis, pointers of interest. There is a growing fuss over the extent to which the blog is an essential form of legitimate journalism or scholarly communication. Blogs succeed or fail on the quality of the content. They can be a significant form of communication. So I think that answers part of the question: there are undoubtedly a goodly number of blogs that deserve preservation, just as there are a goodly number of websites. Every website does not need to be preserved in every one of its versions, nor does every blog.

Let me speak to the other part of the question. Certainly, many blogs are rather personal affairs, one-person shows, that are maintained as long as the author is interested, and in that sense may be at least a bit more ephemeral than the “average” website – though I’ve not seen real data on this. In another sense, however, blogs are much easier to handle than typical websites. Most blogs that I have seen are in essence “grow-only” with older material rolled off to an archive perhaps. On Web sites, when you update a page you replace it. So it’s much more likely that when you “harvest” or “crawl” a blog you are going to get its entire history up to the point you are visiting it. For most Web sites, you have to visit over and over again to get that history, which is represented as a series of versions. So I’m not sure that technically blogs are worse than websites broadly.

We’re going to say a few terms, and we’d like you to tell us what comes to your mind:

Trusted Digital Repository

I associate this with institutional repositories and preservation services, as well as with certain disciplinary archives.

Deep Webvi

The deep Web (sometimes the term “invisible Web” is used) describes all of the datasets and databases that are only observable through dynamically generated Web pages. This material represents the preservation problem from hell. We basically aren’t dealing with this, and we don’t know how to do so very effectively except on a case-by-case basis and often with the explicit cooperation and collaboration of the owner/manager of a specific chunk of dark Web. Note that this is not just documents that are stored in databases – things like the US SEC Edgar database or the US GPO stuff. It’s all the databases of train and plane schedules, stock quotes, product information, rosters, etc. etc. I would assign this a very high priority for research, and I would urge that we look at a range of “scenarios” that include different levels of work by the resource owner as well as the people doing the archiving.

Note that there’s another domain that is related to, but not conterminous with, the deep Web; these are the parts of the Web, deep or surface, that are protected by registration, by robot exclusion indicators, by authentication and access management systems, etc. We aren’t archiving this either, by and large.

OAIS

The Open Archival Information Systems model is a high-level reference and terminology model. I worry that as valuable as this model is, it has been turned into a sacred cow. Every proposal makes a bow to OAIS and conformance. We aren’t seeing much critical analysis of where it is and isn’t helpful. It is, in my view, more valuable as an analytical tool than as a blueprint for systems.

OAI

OAI unfortunately is used for two things: to talk about the “open archives initiative”, a sort of broad movement that’s related to open access issues and that deals with policy, economics, and business models, and a much more tightly focused set of technical efforts – most notably the OAI PMH (protocol for metadata harvesting) that CNI and the Digital Library Federation helped to underwrite and advance. The latter is in my view critically important “plumbing” infrastructure for moving metadata around in distributed environments; what people often don’t understand is that OAI PMH is actually agnostic about business models and has an extraordinarily wide range of applications. There’s some absolutely wonderful work going on now to build an Apache Web server module that would use OAI PMH as a way of much more effectively and efficiently crawling Web sites, and could also be extended to provide access to deep Web materials behind the site easily. See my piece of a couple years ago in ARL bulletin for more info on this.

Fedora

A digital object-based repository that Sandy Payette and Carl Lagoze have been developing for some years. Now they’re working with UVA. It’s a very powerful system, but a complicated system. It has a lot going for it technically. It still needs work both in terms of exposition and in terms of applications packaging that will open it up for a broader user base, and I know that they have been making recent efforts in this area. I am also hopeful that some of the recently announced commercial efforts based on Fedora may be helpful in this area.

Thank you, Cliff, for taking the time to address these issues. Is there anything you wish to add?

We could certainly go on, but we’ve covered a great deal of ground; let’s call it a day.


i Rob Rosenberger article quoting Mike Quear, “technocrat in the U.S. House of Representatives” quoting Frank Kotsonis

ii See an abstract at: http://www.harpercollins.com/catalog/book_xml.asp?isbn=0380975378.

iii See http://www.well.com/user/sbb/.

iv See for example: http://www.w3.org/TR/xml-exc-c14n/.

v See http://www.educause.edu/ecar/research/research.asp.

vi See information at http://www.brightplanet.com/deepcontent/deep_Web_faq.asp or http://www.wordiq.com/definition/Invisible_Web for definitions; and sources such as http://www.nla.gov.au/ntwkpubs/gw/66/html/p15a01.html, and http://www.jisc.ac.uk/index.cfm?name=project_webarchiving on preservation issues for the deep Web.

vii See the June 21, 2004 Federal Computer Week article at: http://www.fcw.com/fcw/articles/2004/0621/pol-crisis-06-21-04.asp


Copyright 2004 RLG.