HomeAboutProjectsProducts & ServicesPublicationsSupport
RLG Logo
  Issue index
 
 
· Apr 15, 2007
 
 
· Dec 15, 2006
 
 
· Oct 15, 2006
 
 
· Aug 15, 2006
 
 
· June 15, 2006
 
 
· Apr 15, 2006
 
 
· Feb 15, 2006
 
 
· Dec 15, 2005
 
 
· Oct 15, 2005
 
 
· Aug 15, 2005
 
 
· Jun 15, 2005
 
 
· Apr 15, 2005
 
 
· Feb 15, 2005
 
 
· Dec 15, 2004
 
 
· Oct 15, 2004
 
 
· Aug 15, 2004
 
 
· Jun 15, 2004
 
 
· Apr 15, 2004
 
 
· Feb 15, 2004
 
 
· Dec 15, 2003
 
 
· Oct 15, 2003
 
 
· Aug 15, 2003
 
 
· Jun 15, 2003
 
 
· Apr 15, 2003
 
 
· Feb 15, 2003
 
 
· Dec 15, 2002
 
 
· Oct 15, 2002
 
 
· Aug 15, 2002
 
 
· Jun 15, 2002
 
 
· Apr 15, 2002
 
 
· Feb 15, 2002
 
 
· Dec 15, 2001
 
 
· Oct 15, 2001
 
 
· Aug 15, 2001
 
 
· Jun 15, 2001
 
 
· Apr 15, 2001
 
 
· Feb 15, 2001
 
 
· Dec 15, 2000
 
 
· Oct 15, 2000
 
 
· Aug 15, 2000
 
 
· Jun 15, 2000
 
 
· Apr 15, 2000
 
 
· Feb 15, 2000
 
 
· Dec 15, 1999
 
 
· Oct 15, 1999
 
 
· Aug 15, 1999
 
 
· Jun 15, 1999
 
 
· Apr 15, 1999
 
 
· Feb 15, 1999
 
 
· Dec 15, 1998
 
 
· Oct 15, 1998
 
 
· Aug 15, 1998
 
 
· Jun 15, 1998
 
 
· Apr 15, 1998
 
 
· Feb 15, 1998
 
 
· Dec 15, 1997
 
 
· Aug 15, 1997
 
 
· Apr 15, 1997
 
 


Click for printable version of this pagePrintable Version
 Contents of: Volume 8, Number 4 ISSN 1093-5371  Print entire issue
  Feature Article 1: Editor's Interview with Clifford A. Lynch  
  Feature Article 2: Applying 3-Dimensional Modeling Tools to Analysis of Handwritten Manuscripts  
  Highlighted Web Site: The Ten Thousand Year Blog  
  FAQ: Blog Today, Gone Tomorrow? Preservation of Weblogs  
  Calendar of Events  
  Announcements  
  Publishing Information  
 Feature Article 1  Print this article only

Editor's Interview with Clifford A. Lynch



Clifford A. Lynch clifford@cni.org
Executive Director, Coalition for Networked Information (CNI)

CNI has long had an interest in digitization and digital preservation. How has that changed in the last seven years during your tenure as Executive Director? How do digitization and digital preservation factor into CNI’s agenda?

Basically, the issue of digitization is a pretty fundamental one, to the extent that one is concerned about material to support scholarship. Huge amounts of material that are critical to teaching, learning, and research remain still in print in paper or other physical formats that are largely inaccessible. We have certainly seen a lot of institutions doing work in this area. At Cornell, to cite just one example that I know you are familiar with, there has been lots of work on digitizing collections, producing digital content that can be exploited to advance scholarship and learning. Since the beginning, though, most digitization has been motivated at least in very large part by desires to improve access. There has been a debate about whether digitization constitutes an actual preservation strategy, as opposed to simply a way to produce a surrogate that can reduce handling of fragile originals. I believe that yes, digitization is an actual direct preservation strategy that is every bit as valid and sensible as, for example, microfilming, but this has been controversial and the preservation world is just getting comfortable about this. A recent ARL policy statement supporting digitization as a legitimate reformatting strategy is, I think, a milestone in acceptance. Digitization enables easy replication of preservation surrogates or copies in multiple locations, making them more survivable than print or film versions.

Digital preservation – by which I will mean a focus on preserving digital objects, as opposed to digitization to preserve originally physical objects – is another fundamental CNI interest for as long as I’ve been associated with it. As discourse goes digital, we’d better figure out how to ensure a lasting record or we run the risk of losing the continuity of the discourse. It’s as basic as that, and it’s significant for all aspects and spheres of discourse: personal, commercial, scholarly, public.

As for the CNI agenda and priorities for the next few years, I think that the overarching objective has to be putting effective large-scale systems to actually carry out digital preservation activities. This means attention to social, economic, legal, and organizational as well as technical aspects of the digital preservation problem. Over the years, there has been lot of focus on a magic bullet technology for digital preservation. Personally, I don’t believe one exists. We’ve seen various proposals on magic bullets, e.g., inscribing information on nickel-based storage that can be read in 10,000 years. It doesn’t get us very far when talking in terms of complex interactions and enormous databases. Emulation isn’t a magic bullet either – though I think it’s a useful tool in the toolbox of digital preservation techniques and technologies.

In the last few years initiatives like the Library of Congress-led NDIIPP have started to focus on economics and ongoing roles and responsibilities issues. These are critical for any real progress – probably the most important parts. We need technology and best practices. What’s been missing is all of our major cultural heritage institutions stepping up to say, “This is our responsibility. Yes, we have limited resources; yes, this is difficult, but we need to do it.”

We’ve had leadership from the Internet Archive in this area since the mid-1990s. The Internet Archive represents a wake up call and a bit of a rebuke to established cultural heritage institutions. National libraries are now stepping up to provide leadership. We’re seeing progress in European legislation for national libraries in recognizing their role in stewardship of digital materials in cultural heritage, e.g., in Scandinavia and the UK. I’ve already mentioned the LC NDIIP. And the broader research library community, at least in the US, Canada, and a few other nations, are developing strategies such as institutional repositories for preserving scholarly information – though I think they are going to have to get more involved in the non-scholarly cultural materials that are the raw materials of future scholarship, which is going to be a messy, expensive, and thankless but necessary task; we can’t just leave this to the national libraries.

We have a pair of “quotable Cliff” questions for you. In the Feb 2003 ARL Bimonthly Report you said, “The thinking about digital preservation over the past five years has advanced to the point where the needs are widely recognized and well defined, the technical approaches at least superficially mapped out, and the need for action is now clear.”   What is your current thinking about that quote?

For what it’s worth, I still believe it, though in hindsight I was perhaps a bit overly optimistic about how widely the needs for digital preservation are recognized. It’s getting better every month and every year, but there’s still a lot of education to do.

In my job, I talk to a lot of senior administrators in higher education institutions. One of the issues they’re concerned about is developing strategies for deploying institutional repositories, which includes making of the case for digital preservation to the Academic community.

I think that within libraries, archives, and related institutions there is a pretty good understanding that digital preservation is a problem that needs to be addressed, not just at the theoretical level, but also the practical level, and a reasonably good sense of the technical approaches needed to address it. When you get beyond our profession broadly, things are spottier. As someone said to me recently, “Data is not the plural of anecdote.”i The anecdotal view is that faculty and a broad population of thoughtful people in the consumer market are getting more clued into the fragility of digital content and key issues about continuity and preservation. The growth in personal music and photo libraries, and in the use of digital documents in all settings has led to an increase in awareness in the consumer market.

The place where there’s still an issue is to make the claim for resources. We need to get the top administrative level of universities to recognize that this is a resource problem. But university leaders have a grocery list of issues a mile long that call for resources. It takes a lot of work to make it not just an important issue, but to move it high up on the list. That needs to be balanced in the higher education setting with the reality that many libraries, because of overall financial stress, are hoping to find new money from the institution for digital preservation. It’s a tough case to make. The higher university level can come back to libraries saying, “Isn’t this one of your core functions? Why do you need new money for it? Shouldn’t this have been a core function all along?” These are painful but real ironies. Research libraries may have to support a good deal of the work they need to do in digital preservation, and more broadly the stewardship of digital scholarly resources, with very limited new funding, at least in the near term. It’s going to be hard.

Our second Cliff quote also appeared in that same issue of the ARL Bi-Monthly Report, where you are quoted as saying: “Stewardship is easy and inexpensive to claim; it is expensive and difficult to honor, and perhaps it will prove to be all too easy to later abdicate.”

The caution expressed in that quote still haunts me particularly as I talk to more institutions about digital preservation and about institutional repositories as an approach to stewardship. As little as 18 months ago, there were relatively few institutions on the repository front - only a handful. Now, repositories have become terribly fashionable. Everyone wants to claim to be building one. It’s easy to download the MIT DSpace software, put it on a machine, and load in a few terabytes of content. That misses the point. An institutional repository needs to be a service with continuity behind it. The more successful institutions are in getting faculty to count on the repository services, the more frightening it gets. Institutions need to recognize that they are making commitments for the long term.

Implementing repository services involves changing how faculty and students who will become the next generation of faculty think about stewardship of scholarly information and the responsibilities of individuals, institutions, scholarly societies, and research funding agencies in this process. I think we are actually going to see a sort of cross-hatch of disciplinary and institutional repositories come into play; collectively these will realize a part of the so-called “cyber-infrastructure” that’s going to support e-science and e-scholarship. The process of implementing, or deciding whether to implement, an institutional repository is an opportunity for the campus community to engage and reflect on these issues.

You've no doubt seen the NSF/LC and NSF-EU reports on setting digital preservation research agendas. What do you think about the status and direction of digital preservation research?

First, it’s wonderful these agencies flag it as a special research area. This gives it visibility and priority. It’s been too long buried as grocery list item 42 and as sub-items for other calls for action and broader research agendas. You know how this goes: you prepare the “real” research agenda for whatever you are talking about, and then you put this extra bullet at the end saying of course we also need research into digital preservation, and into the social, economic, privacy, and other policy issues of whatever technology you were really talking about. Elevating digital preservation to this level acknowledges that it is a valuable and significant issue in its own right.

On the specifics, I’m not a disinterested party. I was part of the planning committee for the NSF/LC workshop and I attended a portion of the workshop. The report conveys my views, not as much as if I’d been the chief author, but they’re there. I’ve argued that we should take a pretty broad view – not just of the high-end stuff (e.g., video games from emulators), but all the way through a range of organizational settings and on to how preservation affects the average person on the street. I’ve argued that we want to look at preservation not just of cultural artifacts, but also look at organizational records that need help. This is a tremendous problem that’s grown. It’s largely understood (though far from solved) on the national government level. In the United States, NARA’s done a good job at focusing on this and has some substantial resources to bring to bear. When I look at the state and local government level or at corporations large and small (not just Fortune 500), this is just a disaster area. We need a research agenda to give guidance and assistance for those who manage these valuable resources; we need to offer them affordable and implementable approaches and supporting systems. This is a place not just for analytical research, but also descriptive research. There is a poor quantification of the scale and severity of the problem. We need a time series of the scale of the problem from different aspects.

Note that preservation is a nasty place to do research. If organizations made use of breakthroughs, they could only demonstrate their success by coming back after a hundred years to check – that’s not an attractive timeframe for research.

You have pointed to the distinction between digital preservation and archiving as the basis for confusion, especially between libraries and archives. Would you elaborate on that theme?

This gets at professional bias and perspective: how archivists are trained to approach problems versus how librarians are trained to approach problems. As someone trained as neither, I am in a position of profound lack of knowledge – an ideal place from which to pontificate about this and offend everybody, particularly by overbroad generalizations. So let me try to put it this way. Librarians think in terms of a collection that’s a bunch objects: a bunch of books each of which is deteriorating. The goal is to keep the books or the content of the books alive into the future. Archivists look more at aggregate activities and processes that document institutional decision-making. Their concern is less with individual transactions objects and more with looking at whether we have captured the aggregation of documentation and context around processes, not isolated documents, but sets of evidence that provide insight into decisions and actions. Particularly in the digital world, parts of long-term archival context are hidden in nooks and crannies of technical infrastructure: directories, PKI systems, etc.

You have been using the term "deep time preservation." Would you talk a little about that?

I don’t want to take credit for the term itself. This references, for example, the book Deep Time by Gregory Benford.ii Deep time refers to such things as building a nuclear waste site for 10,000 years; what kind of signage could be applied that would last as long as the toxins?iii These are exercises about preserving across vast reaches of time. Appropriately Benford is both a science fiction writer and a physicist. Another reference on this kind of thinking is Stewart Brand’s book, The Clock of the Long Now.iv I’ve used the term “deep time” a bit less ambitiously then they, as a sort of shorthand for a particular class of questions people worried about digital preservation like to frame and debate. They ask: “How do we know if this digital thing will be presentable in a meaningful way to readers in 250 years?” Great question, but balance it with another class of questions that are equally vital, but are rather more boring and run of the mill information technology management and security issues: “how do we know the bits that comprise this digital thing will survive uncorrupted for the next two weeks? The next three years?”

When we talk about digital preservation as a serious operational activity, we need to consider both of these extremes; if we focus only on one we will fail.

In a 1999 D-Lib article, you advocated canonicalization as a means to “facilitate  preservation and management of digital information.”  Can you provide a simplified explanation of the theory behind this digital preservation strategy and describe the current status of tool development in support of canonicalization for different kinds of digital objects?

It’s not really a strategy so much as it is a tool or technique to measure and test how well your strategy is working. It’s a way of codifying a representation or rendering of an object in a standard way so that you can see how moving from storing the object in one format to storing it in another format changes the way the object looks or acts.

Again, I don’t claim to have originated the term. As I explain in the paper, I heard it first in the context of the World Wide Web consortium XML digital signature work.v I may have been the first to raise canonicalization in preservation as helpful for being specific about aspects of digital objects you want to take forward through transformations – a computable and specific way to see how well a migration strategy is working on a corpus of materials. I’m not sure how much use the approach has seen in the field. I would imagine that it would be very helpful in exploring a migration from, say JPEG to JPEG2000. I’ve seen a few emails in the last few months from people working in the area of word processing documents. They are looking at canonical forms of documents in the emerging world of new generations of word processors and the impact of importing/exporting to XML. This is a fruitful time to look at these things, I suspect.

In "The Afterlives of Courses on the Network," you made a reasoned pitch about the importance of talking about the preservation of course management systems. Are you aware of anyone who has actually started preserving these things? Do you have any subsequent thoughts about the need to preserve every streamed lecture and every student assignment from every course because it is in electronic form?

I want to be clear here that “Afterlives” was really more about information policy issues than simply about preservation – it addressed the broader questions about when and for how long courses should be retained, and who should have access to them when they are retained. For some courses, or some parts of courses, fairly short term retention may be the right answer. It’s very hard for me to believe that we need to save everybody’s Calculus I homework problems, or their essay on Hamlet from high school, in perpetuity. A few projects are looking at archiving courseware. For example, MIT is addressing dissemination issues in their Open Courseware effort but also looking at long-term retention of material in Open Courseware. There are a number of projects on a smaller scale. Most address material faculty members produce – they are videoing lectures, keeping lecture notes all of the major components. We need to look at many aspects of student-authored material and the issues around that. The essay you mentioned was helpful in triggering policy discussions about issues, but it has not yet crystallized into actual policy at any institution I know of. I have been told of several universities that have got active policy development processes going around this area, but it’s going to take a while. That’s not surprising. It’s a complicated area and it both demands and benefits from a thorough, deliberative, thoughtful consideration that brings in all interested parties within the campus community.

Weblogs have become ubiquitous, and yet their fragility is well documented. Some people would say they’re ephemeral and not to bother with them. Others would say they contain invaluable nuggets for current and future research. What is your view on the implications of blogging for cultural heritage collections and practice? How important are these materials for the historical record? What can be done about them?

Blogs are important in the same way that websites are. Blogs really do range from the sublime to the ridiculous. They can be wonderful sources of thinking, analysis, pointers of interest. There is a growing fuss over the extent to which the blog is an essential form of legitimate journalism or scholarly communication. Blogs succeed or fail on the quality of the content. They can be a significant form of communication. So I think that answers part of the question: there are undoubtedly a goodly number of blogs that deserve preservation, just as there are a goodly number of websites. Every website does not need to be preserved in every one of its versions, nor does every blog.

Let me speak to the other part of the question. Certainly, many blogs are rather personal affairs, one-person shows, that are maintained as long as the author is interested, and in that sense may be at least a bit more ephemeral than the “average” website – though I’ve not seen real data on this. In another sense, however, blogs are much easier to handle than typical websites. Most blogs that I have seen are in essence “grow-only” with older material rolled off to an archive perhaps. On Web sites, when you update a page you replace it. So it’s much more likely that when you “harvest” or “crawl” a blog you are going to get its entire history up to the point you are visiting it. For most Web sites, you have to visit over and over again to get that history, which is represented as a series of versions. So I’m not sure that technically blogs are worse than websites broadly.

We’re going to say a few terms, and we’d like you to tell us what comes to your mind:

Trusted Digital Repository

I associate this with institutional repositories and preservation services, as well as with certain disciplinary archives.

Deep Webvi

The deep Web (sometimes the term “invisible Web” is used) describes all of the datasets and databases that are only observable through dynamically generated Web pages. This material represents the preservation problem from hell. We basically aren’t dealing with this, and we don’t know how to do so very effectively except on a case-by-case basis and often with the explicit cooperation and collaboration of the owner/manager of a specific chunk of dark Web. Note that this is not just documents that are stored in databases – things like the US SEC Edgar database or the US GPO stuff. It’s all the databases of train and plane schedules, stock quotes, product information, rosters, etc. etc. I would assign this a very high priority for research, and I would urge that we look at a range of “scenarios” that include different levels of work by the resource owner as well as the people doing the archiving.

Note that there’s another domain that is related to, but not conterminous with, the deep Web; these are the parts of the Web, deep or surface, that are protected by registration, by robot exclusion indicators, by authentication and access management systems, etc. We aren’t archiving this either, by and large.

OAIS

The Open Archival Information Systems model is a high-level reference and terminology model. I worry that as valuable as this model is, it has been turned into a sacred cow. Every proposal makes a bow to OAIS and conformance. We aren’t seeing much critical analysis of where it is and isn’t helpful. It is, in my view, more valuable as an analytical tool than as a blueprint for systems.

OAI

OAI unfortunately is used for two things: to talk about the “open archives initiative”, a sort of broad movement that’s related to open access issues and that deals with policy, economics, and business models, and a much more tightly focused set of technical efforts – most notably the OAI PMH (protocol for metadata harvesting) that CNI and the Digital Library Federation helped to underwrite and advance. The latter is in my view critically important “plumbing” infrastructure for moving metadata around in distributed environments; what people often don’t understand is that OAI PMH is actually agnostic about business models and has an extraordinarily wide range of applications. There’s some absolutely wonderful work going on now to build an Apache Web server module that would use OAI PMH as a way of much more effectively and efficiently crawling Web sites, and could also be extended to provide access to deep Web materials behind the site easily. See my piece of a couple years ago in ARL bulletin for more info on this.

Fedora

A digital object-based repository that Sandy Payette and Carl Lagoze have been developing for some years. Now they’re working with UVA. It’s a very powerful system, but a complicated system. It has a lot going for it technically. It still needs work both in terms of exposition and in terms of applications packaging that will open it up for a broader user base, and I know that they have been making recent efforts in this area. I am also hopeful that some of the recently announced commercial efforts based on Fedora may be helpful in this area.

Thank you, Cliff, for taking the time to address these issues. Is there anything you wish to add?

We could certainly go on, but we’ve covered a great deal of ground; let’s call it a day.


i Rob Rosenberger article quoting Mike Quear, “technocrat in the U.S. House of Representatives” quoting Frank Kotsonis

ii See an abstract at: http://www.harpercollins.com/catalog/book_xml.asp?isbn=0380975378.

iii See http://www.well.com/user/sbb/.

iv See for example: http://www.w3.org/TR/xml-exc-c14n/.

v See http://www.educause.edu/ecar/research/research.asp.

vi See information at http://www.brightplanet.com/deepcontent/deep_Web_faq.asp or http://www.wordiq.com/definition/Invisible_Web for definitions; and sources such as http://www.nla.gov.au/ntwkpubs/gw/66/html/p15a01.html, and http://www.jisc.ac.uk/index.cfm?name=project_webarchiving on preservation issues for the deep Web.

vii See the June 21, 2004 Federal Computer Week article at: http://www.fcw.com/fcw/articles/2004/0621/pol-crisis-06-21-04.asp


 Feature Article 2  Print this article only

Applying 3-Dimensional Modeling Tools to Analysis of Handwritten Manuscripts

Authors: Jeremy Rowe - Institute for Computing & Information Sciences and Engineering (InCISE)/Partnership for Research in Spatial Modeling (PRISM),ASU (jeremy.rowe@asu.edu), Anshuman Razdan - PRISM (razdan@asu.edu), John Femiani - PRISM (john.femiani@asu.edu)

Introduction

Online access to original historic resources has revolutionized historical research. Primary source materials such as manuscripts are increasingly available, easily searched via keywords or phrases, printed, and excerpted in ways only imaginable to researchers a decade ago. Historical photographs can be located by keyword within collections and even across collections and from general Web resources using the tools now available in search engines such as Google. Once in digital form, the ability to search and interact with the content permits powerful analysis and comparison that is a key foundation for scholarship. In some ways, historical research has never seen greater resources or flexibility.

Unfortunately, the online resources available today are only a small fraction of the historic record that could be available. Scanning photographs, and scanning and transcribing manuscripts are labor intensive and time consuming. The resources needed to build well-catalogued and comprehensively described collections of images are rarely available. Priority is often placed on high-visibility documents, images, or collections since large scale digitizing in most cases is all but impossible.

Providing access to “born digital” textual material is relatively straightforward, at least from a technical perspective. Digital resources created from digital images of machine-printed documents rely on Optical Character Recognition (OCR) to produce searchable versions of the printed pages. The accuracy of OCR from clean printed documents is approaching 100%, although even small errors at the character level can be compounded as they are combined into meaningful words and phrases. Even so, many results are sufficiently accurate for indexing; others require a third step of human correction.

One of the greatest challenges lies in providing access to handwritten materials, particularly historical manuscripts. The FAQ in RLG DigiNews Volume 8, Number 1 provided an excellent overview of handwriting recognition techniques. This paper will provide some additional background and present another tool to assist in human transcription of complex, highly variable handwritten manuscripts.

Figure 1. Example of the cursive writing with varying letterforms and overlapping characters in a Spanish manuscript.

The highly variable characters and writing style of many historic manuscripts raise significant challenges to human transcription and translation, let alone attempts to automate these processes. In addition, these manuscripts frequently include complex embedded annotations that must be also be interpreted. Examples of the complexity of written communication within a handwritten manuscript include:

  1. Cursive writing style - often lacking or with highly variable separation between characters.
  2. Variation of letterforms - unlike typewritten or printed text, each handwritten character varies slightly in shape and size from similar characters elsewhere in the document, even when created by the same individual. For example, the bottom loop of “ y ,” may run into the line below it.
  3. Different writing styles - manuscripts can be written or appended by many different people and include many different styles of cursive writing.
  4. Embedded information - sprinkled throughout the manuscripts are abbreviations, rubrics, diacritical marks, signatures, and other coded communications that must be recognized and associated with meaning more complex than a single letterform.
  5. Notations and omitted characters – insertion of additional information, missing characters, etc. within or between the rows of characters.
  6. Effects of aging such as paper discoloration, foxing, stains, ink transfer and bleed-through from adjacent pages, and many other sources further complicate image based character recognition.

Figure 2. Example of a rubric embedded in a historic Spanish manuscript. (View enlarged image)

Current Methods

Though most digital library projects involving handwritten manuscripts work with images and human translations of historic materials, several projects have attempted to digitize and automatically index or assist human translation of handwritten manuscript materials using optical character recognition (OCR) techniques including:

The Word Spotting Project, University of Massachusetts, Amherst

Offline Isolated Handwritten Thai OCR Using Island-Based Projection with N-Gram Models and Hidden Markov Models. T. Theeramunkong, Thammasat University

Character Cluster based Thai Information Retrieval

Classical Chinese Digital Database and Interactive Internet Worktable. S. Wenyuan, University of Hawaii at Manoa

The general process of character recognition of handwritten manuscripts is based on two dimensional image recognition or stochastic methods, such as Neural Networking or Hidden Markov Models, applied to the two dimensional image. In these applications, the manuscript (page, letters, words, etc.) is represented as two-dimensional bitmaps. Various methods then try to infer the letterform or character from this image.

The key to accurate recognition is the ability to differentiate between the ascenders, descenders, loops, curls, and endpoints that define the overall letterforms. The text recognition problem is compounded when letterforms overlap or merge with adjacent characters. Identification of contractions, abbreviations, diacriticals, and punctuation create additional challenges. Development of the techniques to analyze features from a handwritten line of characters is a significant challenge for computer scientists and has resulted in a number of techniques to extract, sequence, cluster, categorize, and compare features to attempt to recognize and assign meaning to a given character. When presented with highly irregular characters, the accuracy rates for OCR plunge dramatically, even with human intervention to correct and train the recognition software. In one test, the accuracy rate for handwritten materials reached 85-90% for closed tests, but in open tests where “trained” OCR programs were used to recognize new text, accuracy rates dropped to a range of 60-75% [THEERA02].

Figure 3. Example of ink spread and bleed through (L), and staining and surface contamination (R).(View enlarged image)

Online Process

A number of techniques have been developed to recognize handwritten characters as they are created, including Xerox Unistrokes, programs for Personal Digital Assistants (PDAs), using operating systems such as Palm, Windows CCE, and tablet personal computers. These systems rely heavily on what are called “online” techniques that use the information about the direction of the stroke as the character is created and assume a standard line width to predict the character that has been written. These programs can be progressively trained to improve the accuracy of identifying the character style of a single user and can add levels of predictive modeling to try to prompt letterforms within words.

Recognizing complex letterforms composed of many interrelated strokes, such as those used in pictographic languages such as Chinese and Japanese, shares much with historic handwritten manuscripts, particularly those with highly variable characters, multiple authors, annotations, or that contain abbreviations and rubrics in addition to letterforms. Researches with Fujitsu have developed a process that combines an offline recognition component that analyzes a bitmap of the input pattern with the online stroke data to improve accuracy. This hybrid character recognition approach compares both on- and offline results to a reference database to predict the character. In one study, the resulting accuracy was greatly improved over other techniques tested, averaging 91.4% for Japanese and alphanumeric characters [TANAK03].

Offline Process

Unfortunately, recognition of historic manuscripts cannot incorporate real time stroke information and must rely on offline techniques based on image analysis alone. Image based identification, such as recognizing characters from scanned image files, relies on variables such as scan resolution, and shape and line intensity. Scan data typically targets a minimum of 300 dpi and a bit depth of 24, with greater resolution and depth preferred when possible.

Though approaches may vary, the process of digitizing handwritten documents typically involves most of the following steps:

  1. Scanning or photography used to create an image of the manuscript. The quality of the document image (sharpness, resolution, tonal range, etc.) is critical to future efforts. Artifacts of “lossy” compression, such as JPEG, add noise and can be particularly damaging to efforts to automate recognition.
  2. Pre-processing using techniques such as “thresholding” to remove background “noise” from the page surface, staining, or other non-information bearing image data.
  3. Alignment to orient the overall page image in preparation for subsequent processing. Alignment is also used to “de-skew” the characters using techniques to orient and provide a similar frame of reference for all of the characters on a given page.
  4. Segmentation to separate the document in to logical regions for subsequent processing. Segmentation can also extend to grouping “primitives” that compose the regions, such as words or characters.
  5. Processing of the segments to assign meaning to words or characters that is typically augmented by human proofing and correction.

Figure 4. Original image of manuscript (L). Version of the document using thresholding preprocessing to enhance characters (R). (View enlarged image)


Figure 5. Example of segmented document (L) and segmented lines of text (R). (View enlarged image)

Higher Dimensional Analytic Tools

One can think of a hierarchy of analytic tools: from the 2-dimensional analysis of offline approaches to a multidimensional analysis (2-dimensional image plus time) in the online techniques. The additional temporal information provides a context that can enhance analysis, prediction, and accuracy.

The computer science researchers at the Partnership for Research in Spatial Modeling (PRISM) at Arizona State University have devised a process to convert the 2-dimensional (2D) script or two-dimensional image/bitmap into three-dimensional (3D) data. This technique permits application of spatial analytic tools to the problem of recognizing highly variable handwritten characters, and was an extension of the 3D modeling and visualization research at PRISM.

This application was conceived as potential tool to provide enhanced information about complex and overlapping characters in historic 16th – 18th century Spanish manuscripts. These documents were created in historic Spanish language with few breaks between words and include many complicating factors such as multiple authors, multiple entries over time, many overlapping lines and characters, and extensive use of rubrics and abbreviations, in addition to physical deterioration and staining of the manuscripts themselves. Even groups of highly trained expert human translators frequently fail to achieve 100% translation accuracy due to these complexities. For example, the team of paleographers at the Spanish Colonial Research Center of the United States Park Service in the Zimmerman Library of the University of New Mexico currently works on these and similar documents and is able to transcribe an average of 1000 to 1200 pages per year.

The 3D character recognition approach uses an understanding of direction of the written line, density, and other cues to extract information similar to the temporal information about the lines that compose characters provided by the real time online techniques. The 3D method separates the lines and loops that define the characters, converts them into chain codes, and creates a surface model that identifies the volume of the ink that defines a given character. This use of the derived chain codes and volume models provides valuable information that can be applied to the task of separating and correctly identifying the lines that describe a single unit or character [Razdan 03]. Though not sufficient for full OCR, this technique can provide valuable new information to assist paleographers in analyzing and identifying complex characters and separate overlapping lines during translation.

Figure 6. Representation of the 2D bitmap of line scan. Figure 7. Representation of the conversion process from 2D bitmap to 3D voxel (volume) data. Figure 8. Representation of the extraction of 3D surface (triangle mesh) from volumetric data.

The process of conversion and extraction involves three conceptual steps:

    1. convert the 2D image into 3D volumetric data
    2. apply filtering to smooth the voxel data
    3. use extraction program to extract the voxel (volume) surface that represents the character.

An example is the loop in the letter “ l ” in Figure 9. The challenge is to create a model that approximates the character with a set of linear segments parameterized according to the movement of the pen in the same sequence as originally created.

.

Figure 9. Document bitmap (L) and detail of individual character (R).

Using assumptions of writing from left to right for this example, the pen moved right and up and then down over the previous line to end on the right. The resulting chain code is shown in Figure 10. The red arrows show the first half and the green arrows show the latter half of the line that describes the letter “ l ” .

Figure 10. Representation of the chain code indicating line continuity from left to right.

Using the techniques being developed at PRISM, it is possible to extract the chain code information from the volume model, filter the data, and then create a 3D surface model that exposes the temporal sequence of the creation of the original line [RAZDAN 03]. The surface example in Figure 12 clearly shows the overlap of the lines that define the letter “ X ” in the figure below.

Figure 12. Conversion from bitmap through volumization to processed surface rendering of the lines defining the character.
Figure 13. (From bottom to top) Bitmap of character, volumized surface rendering, derived crestline representing line sequence.

These techniques are particularly valuable when used to enhance recognition of groups of interrelated characters or to separate multiple overlapping lines. Figure 14 shows an example of modeled data extracted from the bitmap image of adjacent and interconnected characters. These techniques can be applied to larger groups of characters such as lines of text or larger document segments.

Figure 14. Detail of document scan bitmap (L) and 3D surface model extracted from bitmap showing the line overlap that created the characters.

Objective Measurement

One of the greatest difficulties faced by paleographers when deciphering complex characters and rubrics is the subjective nature of interpretation. The use of derived modeling permits application of objective measurement tools that can be used to provide comparisons with known references, and over time, to build comparative reference collections within and among documents. The 3D domain has many tools developed for comparative analysis to extract and compare lines and edges using curvature. By using 3D techniques, the curvature and topology of the characters can be rotated, translated, and compared to determine and quantify the similarities and differences. These tools derive from disciplines such as medical imaging, manufacturing, and forensic science and are an active area of research and development. They offer the benefits of objectively comparing the shape of lines and characters in ways far more powerful than those used for traditional 2D image processing.

Figure 15. Example of curvature comparison (L) and objective comparisons of letterforms using curvature analysis (R). (View enlarged image)

Conclusion

Handwriting analysis using 3D techniques was developed as an initial step to augment the work of human paleographers. Unfortunately, paleographers today have few clues to aid them in extracting stroke information from handwritten characters and thus rely solely on character mimicking, visual graphic analysis, and other subjective techniques. Developing tools to add stroke information and objective comparative analysis will dramatically improve transcription and the translation and interpretation that follow. As these tools develop and are applied to collections of documents, they will enable paleographers to identify and recognize individual styles and build connections of authorship across groups of manuscripts or among collections. The envisioned process flow, outlined in Figure 16, begins with the scanned image. Scanning and preprocessing identifies meaningful units within the document, then models them using the techniques described above. Once the models have been completed, human paleographers use this information to agree on character identification and sequence, and transcribe, translate, and interpret the information within that component of the manuscript. The components are assembled to build the single manuscript page, then group of pages, etc. This information can then be cataloged, encoded, and made available to researchers.

Figure 16. Representative flow chart of computer assisted human text recognition/transcription process. (View enlarged image)

As these techniques are refined and information about groups of manuscripts begin to grow, hierarchical models of structural clues about document meaning can be developed, as well as tools to identify and label these segments and their component characters. As manuscripts are transcribed, “truthed,” and translated by human experts, machine learning, feature extraction, and training classifier components can be developed to provide semi-automatic/automatic transcription. Tools can also be developed to recognize patterns and information within these segments, and to derive information such as style and similarity, relationships between content on other pages (adjacent or dispersed), and for reconstruction of manuscript content from partial elements such as missing or damaged components or clues such as bleed-through or transfer from adjacent pages.

One hopes that the application of new and more powerful tools and techniques to the problem of transcribing and translating historic manuscripts will improve the speed and accuracy of these efforts, and increase the scope, quality, and availability of these resources for research and scholarship.

The Partnership for Research in Spatial Modeling (PRISM) is the modeling and visualization research center in the Institute for Computer and Information Science and Engineering at Arizona State University. PRISM is the focal point at Arizona State University for interdisciplinary research in modeling and visualization to permit intelligent analysis and create spatial and dynamic knowledge.

The PRISM Handwriting Project Team
Dr. Anshuman Razdan razdan@asu.edu
John Femiani John.Femiani@asu.edu
Dr. Jeremy Rowe jeremy.rowe@asu.edu

Additional References

[RAZDAN03] Razdan, A., Femiani, J., Rowe, J. 3D Methods to Aid Handwriting Analysis and OCR, Proceedings of the 2003 Symposium on Document Image Understanding Technology (SDIUT) Greenbelt, Maryland, April 9-11, 2003, Univ. of Maryland College Park, pp 287. 

[RAZDAN04] Razdan, A., Femiani, J., Rowe, J. 3D Techniques for Analyzing Handwritten Manuscripts for Digital Libraries, invited paper, International Conference on Digital Libraries, New Delhi, February 24-27, 2004 proceedings p430-436. Presentation - http://www.teriin.org/events/icdl/presentation/day4/ar.ppt

[TANAK04] Tanaka, H., Iwayama, N., Skiyama, K. Online Handwriting Recognition Technology and its Applications, Fujitsu Sci. Tech. Journal, 40, 1, p. 170-78 (June 2004).

[THEERA02] Theeramunkong, T., Wongtapan, C., Sinthupinyo, S. Offline Isolated Handwritten Thai OCR Using Island-Based Projection with N-Gram Models and Hidden Markov Models. International Conference Of Asian Digital Libraries (ICADL) 2002: 340-351


 Highlighted Web Site  Print this article only

The Ten Thousand Year Blog



http://www.davidmattison.ca/wordpress/

In the earliest days of Weblogging, many blogs served as a place where authors could note and comment on interesting finds on the World Wide Web. David Mattison does a quite a bit of this in his Ten Thousand Year Blog. In his Blogwise directory profile, he describes his blog as “thoughts about and pointers to the world of digital preservation, digital libraries, e-learning, science, and history by an archivist.” In addition to providing links to news and resources, the site provides commentary on current trends in online information preservation and management. Both email and RSS (Really Simple Syndication) subscriptions are available for more streamlined and automatic access to the Ten Thousand Year Blog’s updates.

Blog entries are search-enabled and cross-categorized under several descriptions, including:

Web Jots
Electronic Records
Digital Preservation
Blogging Experience
Cool Tools
Information Knowledgists
Digital Dark Age Funnies
Distance Education and E-Learning
Essential Readings
History Findings
Syndication Formats and Reading
Collaborative Web
Digital Libraries and Collections
Visualization Systems
Intellectual Property Rights
Search and Retrieval Technology


 FAQ  Print this article only

Blog Today, Gone Tomorrow? Preservation of Weblogs

Author: Richard Entlich - Cornell University (rge1@cornell.edu)

Weblogs seem to be growing in number and stature, but a lot of them seem pretty ephemeral. Are any special efforts being made to preserve their contents?

Background

Weblogs, or blogs for short, are clearly an ascendant part of the Internet. Although the personal diary goes back pretty much to the earliest days of the Web, its formalization as an online activity dates roughly from 1999. In that year, the first blog portals and tools to simplify and automate blogging activities appeared. It's been pretty much up and up ever since.

There are a number of definitions around as to what constitutes a blog, though most agree that a baseline description includes postings (at varying intervals), usually by a single individual, in the form of text, images, and other data forms, arranged in reverse chronological order and accessible with a Web browser. A classic blog entry includes a link to another item on the Web along with commentary on it, but some consist solely of the author's original writing, drawing, photos, music, or other creative formsi. Blogs can be private or public, frequently or infrequently updated, solicitous of comments or not, and cover topics from news and politics to technology, art, religion, culture, and everything in-between.

How big is blogging?

Although the numbers are hard to pin down, most estimates place the number of active blogs at around 2 millionii . The total number of blogs created is much higher and is increasing rapidly, with one forecast that the number of hosted blogs on the major services will exceed 10 million by the end of 2004. Besides the sheer growth in popularity and participation, there is also a burgeoning metablog infrastructure, including blog-specific portals, search engines, popularity ranking services, directories, and census services.

How ephemeral are blogs?

Like any predominantly spare time activity, blogging has to get squeezed into its practitioners' busy schedules. Not surprisingly, the enthusiasm with which a new blogger often greets the activity wanes as the task of making regular updates begins to drag. According to the Perseus Blog Survey, released in October 2003, about 2/3 of over 4 million blogs found on eight popular blog hosting services may have been abandoned, i.e., not updated within the past two months. Over a million consisted of just an initial post. The average active blog was updated about every two weeks.

The demands of blogging can seem especially onerous for authors of popular blogs who develop a large following. Expectations of frequent updates can become difficult to ignore, and burnout has been seen, even among the most committed bloggersiii. It's not unusual for a blog's final post to come without any prior warning, commonly accompanied by explanations that invoke a need to spend more time with family or a desire to move on to other projects. Sometimes, there is no explanation, just a sudden silence.

Recently, the author of a popular blog called The Invisible Adjunct, which is about "the use and abuse of adjunct faculty" announced plans to leave academia and shut down her blog. The blog had struck a chord with many and become a community of commiseration. Its readers felt a profound sense of loss. Many requested reconsideration of the decision to shut down or permission to mirror the site.

There are other ways for blog content potentially to be lost. A majority of bloggers use hosting services such as Blogspot and Typepad to store the contents of their blog. Although blog software and hosting services usually have a built-in archiving function, a look at the terms of service may reveal language like "[we] assume no responsibility for the deletion or failure to store information entered into [our site]" and "you are responsible for maintaining and backing-up your data and information that may reside on the service." Typically, the services also disclaim responsibility for losses that occur if they terminate their agreement with the author for any reason, or if the entire service shuts down.

Just a few months ago (June 2004), a free blog hosting service called Weblogs.com closed down without warning, leaving over three thousand users without access to their blogs. Though a crisis was averted when all the blogs were transferred to another service, the impact of the prospect of lost blogs was illustrative. Some users of the service were extremely angry that they weren't forewarned and given a chance to back up their blogs, with one proclaiming "My entire life is in that blog."

Clearly there are many ways for blog contents to be lost, and both authors and readers are concerned.

How important are blogs?

Blogging is beginning to break out beyond its passionate core of early writers and readers. Bloggers have been credited with political clout (influencing Trent Lott's resignation as Senate Majority Leader; fueling enthusiasm for Howard Dean's presidential bid) and many are receiving credentials at the national party nominating conventions for the first time this year. The "other" media has certainly taken note, but seem uncertain whether to treat blogging as a cultural phenomenon or as competition.

The typical blogger has a day job and blogs as a labor of love, though some of the most popular and prolific bloggers ask for contributions to cover Web hosting costs. As befits a part-time activity, most blogs have acted as secondary sources, commenting on and perhaps providing a new angle on primary source material published elsewhere. Nevertheless, some have major followings, receiving thousands or tens of thousands of visits each day. In February, Wired magazine heralded Instapundit.com's Glenn Reynolds as the most popular blogger. His site receives 100,000 visitors each day, rivaling the viewership of some cable news programs.

Google, which last year purchased blogging software pioneer Pyra Lab's Blogger, estimates that Blogger accounts for 5-6% of its total traffic. Jupiter Media estimates that 4% of online users read blogs. Also, a recent survey by the Pew Research Center for the People and the Press on where Americans get their news found a growing reliance on online sources. Though the study wasn't sufficiently fine-grained to measure blog usage, one can be confident that some of that growth is from blogs. Blogs are also serving as important current awareness tools in many professions, including librarianship, which boasts hundreds.

Clearly, some blogs are very popular and a small percentage of total online users rely on blogs for guidance on what stories to pay attention to and what lens to view them through. However, the vast majority are only seen by a few friends and relations of the author.

Further complicating any effort to assess the importance of blog content is cooptation. The blogging form, widely valued for its spontaneity, shameless partiality, and unfiltered and unedited freshness is being adopted by mainstream media. Outlets like the New York Times and Columbia Journalism Review are starting their own blogs, yet neither they nor traditional bloggers seem anxious to acknowledge a connection. Professional journalists associate the term blog with an absence of standards, while the blogging community sees the use of edited, paid staff as antithetical to the essence of blogging.

The blogs within mainstream publications will likely be deemed important simply because they are part of important publications and archived because of that association. That still leaves the bulk of the millions of traditional blogs with an uncertain claim to be singled out for long-term preservation.

How hard are blogs to archive?

Since blogs are part of the Web, archiving them shares all of the challenges faced by Web archiving in general. These include copyright, robot exclusion, dynamic content, password protection, exotic file formats, and miscoded material. Blogs also offer some unique challenges.

There are obstacles to finding and identifying blogs, especially those that exist outside the popular blog hosting services. Also, blogging packages offer a variety of features and functions. Many allow reader comments and trackbacks (references to their content from other blogs). On some sites comments are maintained only for a short period, then deleted. On others, the comments and trackbacks are handled by a separate service and stored on an entirely different machine (e.g., Haloscan, or TypeKey).iv

Since reader commentary is sometimes integral to understanding the impact and context of a blog, preserving it would, in some cases, require copying content at very regular intervals. In fact, some in the blogging world believe that the comments form the essence of what readers find valuable in a blog and that commentators acquire a kind of "contributor" status on the blogs they frequent, giving them a say in the fate of the blog's contents.

It is not clear what constitutes full and complete capture of a blog. Beyond the internal components, there is also the issue of the links. Blogs frequently contain commentary about content elsewhere on the Web. Link rot hampers the integrity of the Web in general, but since links are an integral part of blogs, link rot can be expected to have an especially insidious impact on the intelligibility of blog archives. Commentary on linked objects that no longer exist or that may not even be identifiable could be akin to virtual Swiss cheese—so full of holes it's unintelligible.

(For another take on blog archiving, see the Cliff Lynch interview elsewhere in this issue.)

What is the status of current blog archiving activities?

Most librarians and archivists have not yet identified blogs as online resources particularly meriting collection and preservation. This is hardly surprising. Web archiving activities in general are still in their infancy, and most of the attention is being paid to Web publications that have characteristics in common with more traditional published material, i.e., ISBNs, ISSNs, regular publishing cycles, and an emphasis on academics.

Of course, there is Web archiving happening, and some bloggers are relying on it. The Invisible Adjunct responded to reader concerns about loss of her blog by stating "the site should eventually be archived at the Internet Archive." That may well be true, but since the Wayback Machine interface to the Internet Archive is anywhere from six months to a year out of date, it's hard to verify.

Nevertheless, the Internet Archive is one of the few Web archiving initiatives at least attempting to perform comprehensive sweeps of Web content, though its sporadic crawls could still miss plenty from a regularly updated and heavily commented blog. Most other Web archiving of any significance is being conducted by national libraries and has a considerably narrower scope. Typically, national Web archiving is conducted according to national deposit laws, many of which were revised fairly early in the digital era to accommodate some network-accessible material, but too early to specifically mention blogs.

One exception is the National Library of Australia's Online Australian Publications: Selection Guidelines for Archiving and Preservation, updated in 2003. It lists blogs under "categories [that] will generally not be collected ... except those that support the academic publications category." Canada's guidelines don't mention blogs, but make it clear that collecting of "public communications" will be done on a very selective basis.

Even in the analog world, libraries and archives have collected ephemeral materials on a limited basis, and only when the items in question fall within an existing collection scope, be it topical, geographical, temporal, or biographical. Collecting of digital ephemera is even sparser because of its sheer volume and the difficulties in selecting and harvesting it.

Good analogies to print materials can help make the case for saving digital ephemera. E-mail is pretty clearly the heir to letter writing, but what about blogs? Are they the equivalent of diaries, journals, datebooks, letters, newsletters, posters, pamphlets, or perhaps all of them? Ultimately, they are a new and still evolving form that has not been the target of any focused collection strategy.

Should blogs be archived?

It's pretty easy to make a case for selective archiving of blogs. They represent a recognizably distinct form of communication that is having a measurable impact on human affairs in the early 21st century. By traditional collection criteria, blogs with the greatest impact should get priority. Impact might be determined by readership or inbound link counts. But a broad sample of well-tended blogs of all sorts would be needed to provide a flavor of blogs as a social and cultural phenomenon and help historians of the future understand our era. Unfortunately, targeted collection of blogs, as well as most other digital ephemera, is not yet getting much attention from librarians and archivists.

Thus, for the most part, it's up to individual bloggers to maintain copies of their creations or else rely on the Internet Archive to do the job. Technology is providing the prospect for individuals to make more and more detailed digital records of their lives. Projects like Sunil Vemuri's "What was I Thinking" and Gordon Bell's "MyLifeBits" aim to capture ever more ephemeral and less structured communication such as in-person conversations and phone calls.

However, it's probably unwise to rely upon finding a MyLifeBits recorder with the personal effects of a noteworthy (or ordinary) individual of the future to gain some insight into their lives. For now, and for the sake of the people whose "whole life is in that blog," there is a growing need to develop a strategy to save at least a few of those lives for posterity.

i For those unfamiliar with the form, see the components of a typical blog.

ii Methods for identifying, counting, and characterizing blog activity vary. See blogcount and NITLE Blog Census for background.

iii As further evidence of blogging's impact, reports of obsessive blogging and blogging's effect on relationships are starting to appear in the mainstream media.

iv This is a fairly recent phenomenon, initiated to provide bloggers better control over comments and deal with automated spam.


 Calendar of Events  Print this article only





Digital Resources for the Humanities Conference 2004
September 5-8, 2004
Newcastle upon Tyne, UK
Part of an annual series, this conference addresses key emerging themes and strategic issues in humanities computing. Major themes this year include: methods in humanities computing; cross-sector exchange between heritage, national and local government, and education bodies; broadening the humanities computing base; and new forms of scholarly publication.

Summer School on Digital Library Technologies
September 6-10, 2004
Pisa, Italy
The DELOS Network of Excellence will sponsor its Third International Summer School on Digital Library Technologies (ISDL 2004) with a focus on "User-Centered Design of Digital Libraries." This weeklong intensive course includes lectures and small group discussions.

Business Models related to Digital Preservation
September 20-22, 2004
Amsterdam, the Netherlands
Building on business models, this three day workshop will focus on issues of establishing solid organizational infrastructure for digital preservation programs.

The American Museums Digital Imaging Survey Benchmarking Conference—Direct Digital Image Capture of Cultural Heritage in American Institutions
September 21-22, 2004
Rochester, New York
The Direct Digital Image Capture of Cultural Heritage Research Program will sponsor this conference to present and discuss the results from their Digital Imaging Survey and case studies. Invited speakers will “cover topics of vital interest to institutional photography departments that have switched to direct digital capture or those that are considering it.”

Ensuring the Long-Term Preservation and Adding Value to the Scientific and Technical Data
October 5-7, 2004
Frascati, Italy
Hosted by the European Space Agency, the theme of this symposium will be "From Preservation to Access." Main topics include: technology and standards, added-value services, users expectations, lessons learned, and future prospects.

LITA 2004 National Forum
October 7-10, 2004
St. Louis, Missouri
The Library & Information Technology Association of the ALA will sponsor this annual three-day educational event for those interested in library and information technology. The Forum will include preconferences, general sessions, and more than 30 concurrent sessions. Enrollment is limited.

DC 2004 - International Conference on Dublin Core and Metadata Applications
October 11-14, 2004Shanghai, China
Online registration and a preliminary program for DC 2004 is now available. The theme of this year’s conference is “Metadata Across Languages and Cultures.”

Archiving Web Resources: Issues for Cultural Heritage Institutions - International Conference
November 9-11, 2004
Canberra, Australia
National Library of Australia
The purpose of this limited enrollment conference is to review key issues faced by cultural heritage institutions in ensuring long-term access to Web resources.

ICADL 2004 International Collaboration & Cross-Fertilization
December 13-17, 2004
Shanghai, PR China
The theme of the International Conference of Asian Digital Libraries (ICADL) 2004 conference is: “Digital Library: International Collaboration and Cross-Fertilization,” with a focus on technology, service and management, and collaboration and localization.


 Announcements  Print this article only





A new PRONOM
The National Archives of the UK has announced that a new version of PRONOM is now online. PRONOM is a searchable Web-based file format registry aimed to facilitate long-term preservation of electronic records.

Award given to the National Archives, UK for “Digital Fridge
The National Archives of the UK was awarded the Pilgrim Trust Preservation Award for their Digital Archive, described as a "giant fridge" for electronic documents.

File Formats for Preservation presentations available
Presentation materials, including audio streams, from the ERPANET workshop entitled “File Formats for Preservation” held in May 2004 are now available via the ERPANET website.

Registry of Digital Masters Record Creation Guidelines
Created by the DLF and OCLC, the Registry of Digital Masters Guidelines describe metadata implementation based on the MARC 21 Format for Bibliographic Data elements and OCLC cataloging system functionality for entries into the DLF Registry of Digital Masters project.

Digital Preservation Research Grants Initiative
The Library of Congress has partnered with National Science Foundation to establish a funding program to address long-term preservation of digital materials. The Directorate for Computer and Information Science and Engineering division of the NSF has issued a call for proposals: http://www.cise.nsf.gov/funding/pgm_display.cfm?pub_id=13106&div=iis

Digitization of California historical newspapers microfilm study
The California Preservation Program is conducting a project to determine the feasibility and scope of digitizing California historical newspapers microfilm. Results from the first phase of the study are provided on their website, which includes links to the comparison scanning service providers that digitized and provided content management for the newspaper content.

NEH Division of Preservation request for proposals
NEH has posted a request for proposals for test bed development for the National Digital Newspaper Program.

UK Web Archiving Consortium
The UK Web Archiving Consortium was launched in June 2004. The Consortium, comprised of six UK institutions, aims to “expand the lifespan of website materials from around 44 days (the same life expectancy as a housefly) to a century or more,” thus ensuring access to selected valuable resources for future generations.


 Publishing Information  Print this article only





RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org.

Materials in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given to use material found here for research purposes or private study. When citing RLG DigiNews, include the article title and author referenced plus "RLG DigiNews." Any uses other than for research or private study require written permission from RLG and/or the author of the article. To receive this, and prior to using RLG DigiNews contents in any presentations or materials you share with others, please contact Jennifer Hartzell (jlh@notes.rlg.org), RLG Corporate Communications.

Please send comments and questions about this or other issues to the RLG DigiNews editors.

Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); Technical Researcher: Richard Entlich; Contributor: Ellie Buckley; Copy Editor: Martha Crowe; Production: Jenn Demaree, Carla DeMello.


All links in this issue were confirmed accurate as of August 11, 2004.




 
Home  |   About RLG   |  Projects  |  Products & Services  |  Publications  |  Support
Usage Statistics  |  Contact Us  |  About This Site  |  Copyright & Permissions  |  Site Map  |  © 2006 RLG
 
  About RLG home
  Mission & goals
  Members
  Board of directors
  Organization
  Events
  News
  Discussion lists
  Jobs
  Contact us
  Projects home
  Projects by goal
  Current projects
  Past work
  Guides & tools
  Working groups
  Products & services home
  Online databases
  Resource sharing & interlending
  Technical services
  Purchasing background
  Publications home
  Newsletters
  Symposium proceedings
  Books & reports
  Publications order form
  Support home
  Usage statistics
  Service schedules
  LI list
  Support contacts