HomeAboutProjectsProducts & ServicesPublicationsSupport
RLG Logo
  Issue index
 
 
· Apr 15, 2007
 
 
· Dec 15, 2006
 
 
· Oct 15, 2006
 
 
· Aug 15, 2006
 
 
· June 15, 2006
 
 
· Apr 15, 2006
 
 
· Feb 15, 2006
 
 
· Dec 15, 2005
 
 
· Oct 15, 2005
 
 
· Aug 15, 2005
 
 
· Jun 15, 2005
 
 
· Apr 15, 2005
 
 
· Feb 15, 2005
 
 
· Dec 15, 2004
 
 
· Oct 15, 2004
 
 
· Aug 15, 2004
 
 
· Jun 15, 2004
 
 
· Apr 15, 2004
 
 
· Feb 15, 2004
 
 
· Dec 15, 2003
 
 
· Oct 15, 2003
 
 
· Aug 15, 2003
 
 
· Jun 15, 2003
 
 
· Apr 15, 2003
 
 
· Feb 15, 2003
 
 
· Dec 15, 2002
 
 
· Oct 15, 2002
 
 
· Aug 15, 2002
 
 
· Jun 15, 2002
 
 
· Apr 15, 2002
 
 
· Feb 15, 2002
 
 
· Dec 15, 2001
 
 
· Oct 15, 2001
 
 
· Aug 15, 2001
 
 
· Jun 15, 2001
 
 
· Apr 15, 2001
 
 
· Feb 15, 2001
 
 
· Dec 15, 2000
 
 
· Oct 15, 2000
 
 
· Aug 15, 2000
 
 
· Jun 15, 2000
 
 
· Apr 15, 2000
 
 
· Feb 15, 2000
 
 
· Dec 15, 1999
 
 
· Oct 15, 1999
 
 
· Aug 15, 1999
 
 
· Jun 15, 1999
 
 
· Apr 15, 1999
 
 
· Feb 15, 1999
 
 
· Dec 15, 1998
 
 
· Oct 15, 1998
 
 
· Aug 15, 1998
 
 
· Jun 15, 1998
 
 
· Apr 15, 1998
 
 
· Feb 15, 1998
 
 
· Dec 15, 1997
 
 
· Aug 15, 1997
 
 
· Apr 15, 1997
 
 


Click for printable version of this pagePrintable Version
 Contents of: Volume 9, Number 3 ISSN 1093-5371  Print entire issue
  Feature Article 1: Selective Archiving of Web Resources: A Study of Acquisition Costs at the National Library of Australia  
  Feature Article 2: Bringing the Digital Revolution to Medieval Musicology: The Digital Image Archive of Medieval Music (DIAMM)  
  Highlighted Web Site: PrestoSpace  
  FAQ: A Little Bit'll Do You (In): Checksums to the Rescue  
  Calendar of Events  
  Announcements  
  RLG News: PREMIS Final Report Released; Certification of Digital Archives Project Begun  
  Publishing Information  
 Feature Article 1  Print this article only

Selective Archiving of Web Resources: A Study of Acquisition Costs at the National Library of Australia

Author: Margaret E. Phillips - National Library of Australia (mphillips@nla.gov.au)

Introduction

During the past decade a small but growing number of national libraries have established Web archiving programs. These programs have taken one or more of four main approaches:

  • selective archiving, for example, the archives of the National Libraries of Canada, Japan, and Australia1
  • periodic harvesting of the country’s entire Web domain, exemplified by the archives of the Nordic countries, including Sweden2
  • thematic collecting, exemplified by the Library of Congress’s MINERVA3 collections of the Elections 2000 and 2002 and of September 11, 2001
  • deposit collections, such as STORS at the State Library of Tasmania and the e-Depot4 at the National Library of the Netherlands

The National Library of Australia chose the selective approach because of its perceived advantages:

  • Each item in the Archive is assessed for quality and is functional to the fullest extent permitted by current technical capabilities.
  • Each item in the Archive can be fully catalogued and therefore can become part of the national bibliography, and the bibliographic data can be shared. In the Library’s own catalogue, Web resources are integrated with all other resources and users need look in just one place.
  • Each item in the Archive can be made accessible via the Web to readers immediately because permission to do so has been negotiated with publishers.
  • The properties of individual resources and classes of resources within the Archive are known to collection managers. This enhances our ability to develop methods and tools to collect them in the first place, store them, and provide access to them. This knowledge also apprises collection managers of the preservation strategies that will be required to keep the resources accessible for the long term.
  • Sites that are inaccessible to harvesting robots can be identified and gathered using other methods as arranged with the publisher. This includes commercial titles, which may require a publisher-supplied password, and databases.

Despite these advantages, each of the archiving approaches has disadvantages, and the selective approach is no exception. It relies on library staff operating in an environment with very new information to judge what will be required for research in the future. It also takes a resource out of context, breaking its links to external resources. From the point of view of archive development and management, the biggest disadvantage is that the process is labour-intensive and the unit cost of each item collected is high.

Background

The National Library of Australia commenced experimental, selective Web archiving in 1996. At that time it was a very new activity and very little had been written about it. There was no one from which to learn. We did not know enough about Web archiving to write a plan or strategy and to project its costs. We could only proceed by taking small practical steps and learning as we went along. No funding had been received to undertake this new business, and we had to redirect collection development staff who showed interest in and aptitude for Web archiving away from more traditional library tasks to this new activity. For example, one person in the IT Division spent part of his time considering how these resources might be downloaded from the publishers’ websites and their contents kept on the Library’s server. We were obliged to use freely available software for harvesting and managing the downloaded files. Proceeding in this modest way, our early Web archiving costs were quite low and were largely hidden within existing staff budgets.

In the ensuing nine years, operations became much more sophisticated. The library established PANDORA, Australia’s Web Archive,5 as an ongoing operational archive, and nine other Australian libraries and cultural collecting agencies became partners, one by one. On the whole, growth was incremental and the institutions absorbed the increasing costs. A greater volume of archiving activity and the need to support partners contributing to the Archive from remote locations demanded a sophisticated technical infrastructure comprised of a delivery system, an archive management system, and storage. This meant quite substantial development costs, although these costs were still met through existing staff budgets in the Collection Development and Information Technology Divisions.

The Library could not find a suitable archive management system for purchase and therefore developed the PANDORA Digital Archiving System (PANDAS),6 along with the PANDORA delivery system, in-house. The Library also purchased a Digital Object Storage System (DOSS), which PANDORA shares with the Library’s other digital collections.

One thing has not changed in the past nine years: the Library still has no additional funding to undertake this activity. In an odd way, this proved advantageous, as it forced us to fund the activity from the ongoing annual budget allocation from the Australian government, and there was no special short-term project funding that, when it came to an end, left the activity unsustainable.

The Pandora Archive Now

The PANDORA Archive is a collection of significant Australian online publications and websites, developed by the National Library and its partners, that is stored, managed, and maintained centrally at the National Library in Canberra. As of April 30, 2005, the Archive contained 8,235 titles, growing at the rate of approximately 2,400 titles per year. These titles may consist of a single file, such as a text document in Portable Document Format (pdf), or they may be complex Web objects, such as a large website, consisting of thousands of files in a variety of formats, including text, sound, image, or video. Many of the titles are re-gathered on a regular basis, creating a new “instance” of the title in the Archive, of which there are now 16,736. In building the Archive, the objective is to copy selected titles into the safety of the Archive and to provide access to them in perpetuity.

The Archive includes both static and dynamic online publications and websites and represents a wide range of publication types and formats used by publishers and creators on the Web. It includes online publications and websites that have now disappeared from the live Web and that are no longer available anywhere else.

Most of the titles in the Archive are freely available to anyone, anywhere in the world with an Internet connection. Access to a very small proportion of the Archive is restricted, usually for five years or less, for commercial reasons. These restricted titles can be consulted on a single PC in the Library’s main reading room.

Access to the contents of the Archive can be obtained either via a hot link in the catalogue record for a particular resource or from subject and title lists on the PANDORA website. Commercial search engines, such as Google and Yahoo!, index the Archive to the title level.

Tasks involved

The tasks that the staff of the National Library and other partners undertake as part of the PANDORA Web archiving program are determined by a number of factors, including policy decisions and local circumstances. These policy decisions and circumstances affect, in no small way, the cost of this program.

PANDORA contributors place a high degree of emphasis on preserving the “look and feel” (appearance and functionality) of a publication or website, as well as its contents, to the greatest extent possible. Once the harvester has copied a resource to a server at the National Library, staff of contributing agencies check this copy for completeness and functionality before consigning it to the Archive for public access. This quality assurance process is very time-consuming and, therefore, expensive. It is, in fact, the most expensive aspect of the acquisition process.

Each title in the Archive is catalogued with a record in the National Library’s and other partners’ online catalogues, as well as with the National Bibliographic Database (a union catalogue of records of over 850 Australian libraries, with access provided by the Kinetica7 service). This policy decision was made because it was considered important that the discovery of online resources be integrated with discovery of all of the Library’s other collections. It was also considered important that these significant publications, which are part of the national collection, should also be part of the national bibliography. This decision does, however, add to the cost of our Web archiving program.

Another significant contributor to staff task time is circumstances involving Australia’s legal deposit laws. At the Commonwealth level and also for most of the States, legal deposit legislation has not yet been extended to include online publications. Only the Northern Territory has recently passed legislation that unambiguously includes online publications in the legal deposit provisions. This means that all other partners, including the National Library, must seek permission from the publisher before copying a resource into the Archive and making it publicly available.

PANDORA staffing

Building the PANDORA Archive, developing the applications on which it depends, maintaining its systems, and establishing expertise and planning for its long-term preservation involves staff from six branches in two divisions of the National Library.

Staff in the Digital Archiving Branch of the Collections Management Division are responsible for selecting and archiving content. The Applications Branch of the Information Technology Division develops and enhances the technical infrastructure. Website Services guides development of the user interfaces for both PANDORA and PANDAS. Business Systems Support maintains the production, testing, training, and evaluation systems and keeps them operating smoothly. Staff in the Preservation Services Branch of the Collections Management Division are responsible for maintaining long-term access to the contents of the Archive.

At the time the costing exercise described below was undertaken in 2004, the National Library’s contribution to the PANDORA Archive in full-time equivalents (FTE) was at the following levels:8

Digital Archiving Branch:

1 FTE EL2 (manager - librarian)

2 FTE APS6 (one supervisor and one special projects officer - librarians)

4 FTE APS5 (operational staff - librarians)

.17 FTE APS5 (technical problem solver – IT background)

Information Technology Division:

.2 FTE EL2 (two managers who spent 10 percent of their time each on system development and system maintenance)

.5 FTE APS6 (two staff who sent 25 percent of their time each on system development and system maintenance)

As mentioned above, staff from Preservation Services contribute at the equivalent of one full-time position, however, the cost of preservation was not included in this costing.

The Cost of Acquiring Online Publications and Websites

Although the Library has known that the unit cost of archiving (acquiring) an online publication is high compared to the acquisition of a printed book or serial, until recently we have had no precise information about the cost of Web archiving. In 2004, the Library decided to cost its legal deposit activities for serials and monographs and, since collecting Australian online publications is regarded as an extension of our legal deposit responsibilities, it was decided to cost that activity as well.

Scope of the costing

The costing exercise examined the cost to the National Library of acquiring an “instance”9 and adding it to the PANDORA Archive. The boundaries of what costs would be included and excluded were defined.

Only the direct costs incurred were included:

  • staff costs for the Digital Archiving Section at the National Library
  • the Digital Archiving Section’s share of administrative costs, such as travel, training, conference attendance, and office supplies (supplier costs)
  • infrastructure development and maintenance costs, such as IT staff and hardware and software purchases

Costs that were excluded were:

  • indirect costs, such as the provision of work stations to staff members
  • lighting and building maintenance
  • cost of preserving the contents of the Archive

Only costs incurred by the National Library were considered; costs that partner agencies assume in employing staff to contribute to the Archive were excluded.

Methodology

In preparation for the costing exercise, staff prepared a detailed flow chart of all tasks and processes (that is, tasks undertaken by the Digital Archiving Section) required to acquire an instance for the Archive. Key cost drivers (activities) were determined. By a consensus process, staff estimated the average time in minutes that is spent per staff member on each driver each day. A working day contains 441 minutes.

  • identification and selection – 30 minutes
  • publisher contact, negotiating permission to archive the title, and filing of correspondence – 30 minutes
  • gathering, quality assurance, and archiving instances – 210 minutes
  • cataloguing – 81 minutes
  • other activities (includes correspondence with indexing and abstracting agencies, reference enquiries based on the Archive, and Digital Archiving staff contribution to the development of PANDAS) – 60 minutes
  • partner liaison and support - 30 minutes (activity not included in this costing)

Not all staff undertake all of these tasks, and some staff do more or less of them than others. For instance, the supervisor of the Section also carries out administrative tasks and staff supervision. This means he does less gathering, quality control, archiving, and cataloguing than the others.

Of the tasks listed above, the manager of the Digital Archiving Section undertakes only partner support/liaison and occasional contact with publishers. Most of her time is spent on administration, policy development, publicity, and liaison with other organisations involved in Web archiving. However, her salary is regarded as a necessary component of the Library’s overall digital archiving costs and was included in the overall costing.

Some staff are involved in tasks that were completely out of scope. For instance, two librarians spend time each week on the reference desk in the Reading Room. This time was subtracted from the total time spent on archiving.

As the leading partner in PANDORA and the supplier of the technical infrastructure, the National Library has a support role in relation to all other partners, and this involves additional cost. All Digital Archiving Section staff spend varying amounts of time liaising with and providing technical support to partners. Operational staff spend an average of 30 minutes each per day in partner liaison and support, while the manager spends an average of ten minutes per day. This activity was not included in this costing exercise but was reported separately.

Calculating the costs

In the next stage of the costing exercise, an Excel-based costing methodology was designed to calculate activity costs per instance archived.

The salaries of Digital Archiving staff and daily time spent on each of the drivers by each staff member were entered into the Excel spreadsheet. The total number of minutes spent on each driver and the staff cost of each driver per day was then calculated.

Various supplier costs, as described earlier, were then added. Infrastructure development and maintenance costs were also included at this stage.

A total of 937 instances were archived by the National Library during July – October 2004, an average of 13 per working day.

Acquisition costs

With all of this data entered, the spreadsheet calculated that each archived instance cost AUD$178.68, excluding the activity of partner liaison and support. It also gave us information about the component breakdown of this cost:

  • Digital Archiving staff cost per instance archived - AUD$168.36
  • Supplier costs per instance archived - AUD$3.41
  • Infrastructure development and maintenance costs per archived instance - AUD$6.91

Here the stark reality of the high cost of the labour-intensive, selective approach to Web archiving is apparent. Staff costs comprise 94 per cent of the unit cost.

Comparison with print

The high cost of acquiring Web publications compared to printed publications was also highlighted by comparing these costing results with similar ones undertaken at the same time for legal deposit printed monographs and serials.

  • Cost of acquiring a legal deposit monograph - AUD$43.77
  • Cost of acquiring a legal deposit serial issue - AUD$11.29

Attempting to compare these costs is a bit like trying to compare the unit cost of transporting watermelons, bananas, and grapes to market. They are not the same in a number of dimensions. An “instance,” the unit of measurement for the Web publication costing, is not the equivalent of either a monograph or a serial issue. Nevertheless, despite considerable differences in the nature of these publications and the processes involved in acquiring them, it is quite clear that the cost of acquiring Web publications is substantially higher than for print publications.

It should be noted here that no purchase costs are involved in these figures. The Library receives its legal deposit printed materials free of charge, and the Web publications are harvested free of charge from the publishers’ websites. These costs relate solely to the acquisition process. Shelf preparation of items was included in the costing. The subsequent costs of binding, stacks management, and collections care were not considered.

Cost of particular activities

The National Library was also interested in analysing the cost of individual drivers (activities). The staff broke down the costs per instance as follows:

  • identification and selection – AUD$10.16
  • publisher contact, negotiating permission to archive the title, and filing of correspondence – AUD$10.34
  • gathering, quality assurance and archiving of instances – $71.09
  • cataloguing – AUD$27.42
  • other activities (includes most of the manager’s activities, correspondence with indexing and abstracting agencies, reference enquiries based on the Archive, and Digital Archiving staff contribution to the development of PANDAS) – AUD$59.67

Note that this cataloguing cost is not very useful since it is a per instance cost, and resources are catalogued at the title level, not the instance level. Given that each title in the Archive has approximately two instances, a more realistic cataloguing cost is $54.84.

Possibilities for reducing costs

From this activity-level costing information we have been able to consider how we might reduce our costs. We could reduce our costs by changing our approach to Web archiving (changing our policies) or by finding ways to perform the tasks involved more efficiently or both. For the foreseeable future, we will maintain existing policies, for instance, of creating catalogue records for each title and of undertaking rigorous quality assurance. There are other opportunities for savings.

The identification and selection of titles is fundamental to the selective approach to archiving and, at first glance, it is difficult to see how we might reduce time spent on this activity. However, we are working with government publishers to supply metadata for online publications, which will be batch loaded into PANDAS for automatic harvesting of the described publications. In effect, the publishers will be identifying what will be archived in PANDORA by supplying the metadata for it. PANDAS has yet to be enhanced to enable this batch harvesting and processing to take place, but we anticipate that once it is operational, the average cost of acquiring the publications of participating agencies will be significantly reduced. The metadata being supplied by a small number of publishers is already being automatically converted to MARC records for inclusion in the National Bibliographic Database and will be downloaded to the Library’s online catalogue. This will help to reduce cataloguing costs.

Further calculations on the components of the most expensive driver (gathering, quality assurance, and archiving of instances) revealed that quality assurance comprised 86 percent of the cost. If we could identify reliable quality checking software to plug into PANDAS, then the cost savings could be worthwhile.

The Library is also lobbying the Australian government for the extension of legal deposit to online publications in order to obviate the necessity of obtaining permission from publishers to copy titles into the Archive. Staff time will be reduced after this comes to fruition.

Technology will improve and will enable us to reduce costs. The International Internet Preservation Consortium,10 of which the National Library is an active participant, is working on a suite of Web archiving tools, including a harvester designed specifically by and for national libraries. This is expected to make our work more efficient.

Conclusion

The costing calculation confirmed what we already knew –Web archiving is a costly business when compared to the acquisition of printed materials and that currently the per unit cost of selective archiving is particularly high because of its labour-intensive nature. This study has also given us useful additional information against which to evaluate our program. Are we really wedded to the policy of cataloguing each title in the Archive? Yes, but perhaps we can find less labour-intensive ways of creating the record. How committed are we to the very expensive quality assurance process? Very committed, but installing suitable checking software will be a high priority when it can be located.

Increasing sophistication of Web archiving technology will increase the efficiency of all Web archiving programs, including those using the currently labour-intensive selective approach, and lower unit costs over time.

Acknowledgement

I would like to acknowledge the major contribution to this costing made by Nizam Yoosuf, Manager, Finance Branch, National Library of Australia. He designed the costing methodology and assisted with the interpretation of the results.

Notes

1Library and Archives Canada. Electronic Collection: a Virtual Collection of Monographs and Periodicals. http://www.collectionscanada.ca/electroniccollection/ (accessed 22 March 2005); National Diet Library of Japan. Web Archiving Project (WARP). http://warp.ndl.go.jp/ (accessed 22 March 2005); National Library of Australia. PANDORA, Australia’s Web Archive. http://pandora.nla.gov.au/index.html (accessed 22 March 2005).

2National Library of Sweden. Kulturawa3. http://www.kb.se/kw3/ (accessed 22 March 2005).

3Library of Congress. MINERVA. http://www.loc.gov/minerva/ (accessed 22 March 2005).

4State Library of Tasmania. STORS: Long Term Storage of Tasmanian Electronic Documents. http://www.stors.tas.gov.au/ (accessed 22 March 2005); Oltmans, E. and H. van Wijngaarden. 2004. Digital preservation in practice: the e-Depot at the Koninklijke Bibliotheek. Vine 34 (1): 21-26.

5Information about PANDORA, Australia’s Web Archive, and access to its contents are available at http://pandora.nla.gov.au/index.html.

6Information about PANDAS is available at http://pandora.nla.gov.au/pandas.html.

7National Library of Australia. Kinetica. http://www.nla.gov.au/kinetica/ (accessed 22 March 2005).

8Levels and pay scales are explained in Attachment A – Salary Table of the National Library of Australia Certified Agreement 2004-2007 available here.

9An “instance” is a single gathering of a title. It includes the gathering of a monograph that has been archived once only, the first gathering of a serial or integrating title (for example, a website that changes over time), and all subsequent gatherings.

10International Internet Preservation Consortium http://netpreserve.org/about/index.php (accessed 22 March 2005).


 Feature Article 2  Print this article only

Bringing the Digital Revolution to Medieval Musicology: The Digital Image Archive of Medieval Music (DIAMM)

Authors: Julia Craig-McFeely - Royal Holloway, University of London (julia.craig-mcfeely@music.ox.ac.uk), Marilyn Deegan - King’s College London (marilyn.deegan@kcl.ac.uk)

Introduction

The Digital Image Archive of Medieval Music (DIAMM) has been in existence since 1998, with funding from the UK’s British Academy and Arts and Humanities Research Board and The Andrew W Mellon Foundation. Originally a collaboration between Oxford University and Royal Holloway, University of London, the partnership now includes King’s College London. DIAMM’s aims and objectives have evolved from the initial goal of establishing a preservation-quality digital archive for at-risk sources of medieval music (currently holding some 7000 images) to a full online delivery system for images and metadata with tools for scholarly manipulation and annotation. What has never wavered over these years is the total commitment to the highest resolution images, excellent working practices, and rigorous quality control. This article describes the project from its inception to its current phase, charting some of the issues, problems, and successes.

DIAMM: the background

DIAMM came about as the result of a fortuitous collision between a 35-year-old facsimile publishing series, and the rapid advances in high-resolution digital camera technology, and its accessibility to academic users. For the first time it was possible to obtain a digital scanning back, mounted on a conventional camera body, as well as software that would allow post-processing and image manipulation, together with computers that could handle 250 MB image files without growing old in the process. Admittedly, digital printing at this sort of quality was in its infancy in 1998, but high-end print producers were already embracing digital masters as the way forward.

The Early English Church Music series had been producing facsimiles and editions of English music predating those written in modern notation for a considerable time. The volume of 15th-century fragments was limited by the fact that the fragments were in such poor condition that almost no publication could do them justice, nor could they be of any benefit to those who wished to use the facsimiles for research. These fragments had survived simply by dint of being reused for other purposes, for instance as binding reinforcements for non-musical sources, but had been damaged by water, dirt, rats, glue, or the effects of being stuck down to a wooden board for 600 years or more.

The Corpus was reasonably large thanks to discoveries made over the last 50 years by musicologists, scholars, and librarians, who began to take notice of the parchment endpapers in their medieval and later books. Time and condition, however, had ensured that these fragments continued to deteriorate and were not receiving the scholarly attention they deserved, except from a tiny number of scholars who had purchased microfilms, slides, or black and white glossy prints of them. Digital imaging and its associated software offered opportunities to rescue some of these manuscripts from obscurity, and the (at that time) newly burgeoning medium of the Internet offered a way of reunifying objects that had become extremely widely scattered during the course of the 600-900 years. Leaves from one particular manuscript from the UK that had been broken up and used for binding reinforcement were dispersed as far afield as Estonia and Canberra in Australia. Reunifying these disparate leaves with leaves of other manuscripts from a similar period had the potential to offer scholars access to a repertory that had hitherto been extremely difficult to study, simply due to its geographical diaspora. Added to that, the potential to improve legibility using digital restoration offered an unprecedented and exciting challenge to the directors of DIAMM, Dr. Margaret Bent (Oxford) and Professor Andrew Wathey (Royal Holloway, University of London).

Image capture

When DIAMM began in 1998, few libraries had the resources to buy equipment of the required quality, so the project has, from its inception, concentrated on developing a mobile strategy that could deliver consistent end results. The scanning “studio” is entirely portable, including lights, stands, camera, scanning back, and computers. We can walk into a library anywhere in the world, set up the equipment, and be ready to take a picture within 45 minutes to one hour of arriving. And we can guarantee that the quality of the image will match that of any images taken by our photographers anywhere in the world.

The project started out with a 49 megapixel PhaseOne scanning back and, thanks to upgrades funded by The Mellon Foundation, is now using a PhaseOne PowerPhase FX, which has a maximum scan area of 144 megapixels. One of the reasons for using the PhaseOne hardware is that it is the most suitable for a “portable” environment. The scanning back captures 24- or 48-bit RGB images, which are fed directly to a computer via a firewire connection. This enables us to scan moderate-sized manuscripts at 800-1200 dpi at real size and larger manuscripts at around 400-800 dpi. This is perhaps not very impressive next to the resolutions of flatbed scanners, but the value and age of these books means that laying them on a scanner is out of the question, even if they could be opened fully enough to do so.

We have also concentrated on establishing a base-line methodology that ensures a completely consistent imaging process that meets rigorous quality standards. Every image includes industry-standard colour, gray, and size scales, and is scanned at actual size. The colour profile of the scanning back is embedded at capture, and the files are saved in uncompressed TIFF format with all the capture metadata stored in the TIFF header at the point of capture. The metadata is extensive and includes information such as the distance of the lighting heads from the manuscript and height of the camera above it, as well as manuscript and library details. Any information may be relevant to future users of the archive. The colour control information enables evaluation of the colour accuracy of the image by our external quality assurance process,1 and also gives anyone viewing the images on screen additional information so that they can adjust their hardware to give the most accurate rendering of the image. DIAMM has also pioneered the use of ultra-violet (UV) light imaging with a digital sensor. Although some libraries have experimented with digital UV, only DIAMM, with input from the scholars who are using the images, has evolved a workable strategy for UV imaging with a digital sensor.

A number of libraries are now producing their own scanned work, but surprisingly few seem able to meet the baseline standards set by DIAMM, for example, by failing to employ any form of colour calibration in their setup, post-processing images to colour-correct by eye on uncalibrated hardware, sharpening images that were captured out of focus, and so on. The biggest problem for any digital imaging project at the moment is ignorance of the steps necessary to create an “archive” quality image – that is, a master image that requires no post-processing.

Relations with libraries, rights, and licensing

Since the DIAMM collection does not belong to DIAMM, but to the various contributing libraries who license the project to create the images and keep copies in the archive, a complex rights structure was necessary. In 1998 most libraries did not have their own website, and many did not have Internet access of any description. Online delivery of their collection was viewed with a good deal of suspicion, so a considerable leap of faith was required to allow a third party to provide online access. Almost every library, archive, and record office in the UK visited during the first two years of the project agreed to allow DIAMM to make their materials available via the Internet—a testament to their commitment to facilitating public access to their documents—albeit behind a careful protective mechanism. The unprecedented step of libraries allowing their collections to be available to an online community is something for which DIAMM is extremely grateful and without which DIAMM would not have grown as it has into an unusual and constantly evolving online resource.

Time and rapidly developing technology have forced DIAMM to rethink all its structures and delivery methods and mechanisms. Not only that, the rapid growth of the collection has raised management issues that could not have been anticipated in the early days of the project. Originally around 700 images had been planned for the complete collection. The archive now contains over 7000 images (not all of which are available online) and an additional 6500 images promised to us from a digitization project in Germany to which DIAMM contributed in an advisory capacity. The archive continues to grow as libraries contribute their own images and we continue to negotiate to photograph sources that would otherwise be difficult or impossible to access.

From a UK-based project, DIAMM has extended its work and consulted on projects across Europe and worked as far afield as Japan. Recently a manuscript was flown from the USA to Oxford in order to enable DIAMM to create digital images of archiving and restoration quality. In Europe we have relied heavily on the negotiating abilities of local scholars to persuade libraries to allow us to photograph their manuscripts, particularly in Italy where there is a nightmarish suspicion of anything that is offered free or that requires any form of contract. However, the contracts drawn up by DIAMM and refined over the years have had a major influence in convincing many libraries of not only our concern for their rights, but also our intention to protect the images and make no commercial use of them.

Different countries have surprisingly different attitudes to our approaches. In Belgium and Switzerland we have been welcomed with enthusiasm, and libraries have offered their images for online access as well as for the archive. This stands in stark contrast to Italy, where sometimes three years of complex and anxious negotiation have been necessary before the library will consent to allow us access, even with a contract drawn up by their own legal department. The support of the Institute of Historical Research (IRHT) in France opened doors in libraries and archives across that country, and support from French scholars has also been considerable. In some libraries in France, DIAMM images are given to scholars rather than allowing them to handle the manuscripts they wish to study.

Digital restoration

The key to digital restoration is to have an extremely high-quality image at the outset. Scans of surrogates are largely useless for the purposes of digital restoration and cannot be enlarged on screen to the degree that high-resolution digital images can be blown up. Enlargement alone can reveal details of scribal method and obscured material that are not visible to the naked eye, without any necessity for post-processing intervention.

DIAMM image
Detail of a high-resolution scan made by DIAMM.

Not having attempted any form of digital restoration before embarking on the project, the extent to which DIAMM’s restoration processes could enhance the manuscript images was largely speculative (coloured by a slightly “science-fiction” idea of what might be done), and it wasn't until we were able to obtain images of sufficient quality to allow us to embark on digital restoration that we began to realize just how exciting this project might be. We were suspicious in the early stages that what could actually be done would not be as revealing as we had hoped. The earliest restoration undertaken by the project manager, however, proved that we could go beyond our expectations in repairing some of the damage that had been done to these sources.


GB-Oxford, Corpus Christi College MS 144, Palimpsest, before and after restoration.

These early attempts may have given us a more optimistic view of what could be done than reality proved, since some sources remained resistant to our manipulation, despite appearing to be good subjects for restoration. Nevertheless, others that did not appear hopeful at firstyielded staggering results in light of our expectations.


GB-Stratford, Shakespeare Birthplace Trust, DR 37 vol.41, outer wrapper, before and after restoration. Originally it was thought there was no music on this leaf; the piece has now been identified.

If only 20% of the sources were to improve under digital manipulation, the quantity and quality of materials available for study would be dramatically improved. In fact, of the 40% of sources requiring some level of restoration, some 38% are now fully readable again, if not particularly beautiful, and the remaining 2% are significantly improved.

Delivery

Since the use of the Internet as a research resource was in its infancy in 1998, we did not anticipate anything more than building an archive to which scholars could gain local access. Once we initiated image collection for the archive, we established a website to provide project information. Within months of going live, demand from scholars interested in our sources showed that we really needed to think about delivering the images online – something that now, eight years down the line, seems absolutely fundamental.

However, at the time, delivering images on the Internet was a slow and unwieldy process. Internet access, outside university or business Ethernet networks, was by modem and many scholars would have to access these images using a slow connection.

Internet technology (browsers, delivery methods, Javascript, etc.) was not geared towards delivering exceptionally large images to a remote user. The project was therefore faced with finding a method of delivering these images that would allow scholars the most flexibility in examination whilst also allowing for slow download times. Our first attempts of online delivery used the PDF format, since it offered zoom, rotate, and security features that would not require most users to download third-party software in order to see the images. Even during this period it was surprising how many scholars took up the offer of accessing these manuscripts and were prepared to wait up to 6 minutes for an image to arrive in a form and size suitable for detailed examination.

Any project, particularly a digital one, that runs for more than a few years, must evolve as rapidly as the medium in which it works. DIAMM has had to change with the times and has looked to our existing funding bodies (the Arts and Humanities Research Board) and newer ones (notably The Mellon Foundation) to provide us with the resources to grow and advice on directions to take. The original “flat” informational website set up in the early years has now become a dynamic database-driven resource engineered and maintained by the Centre for Computing in the Humanities at King's College London.2

The delivery system we now employ is fast and efficient and allows even those using a dialup Internet connection to make use of images at a very high resolution without significant download delays. Our early efforts at delivering online were directed towards an academic user-base with a specific knowledge and interest in this particular branch of musicology, but the profile of the users has broadened considerably since then and now embraces all disciplines and users ranging from specialist academics to schoolchildren doing a history project, as well as many performers who come to the resource for fresh performing materials.

The original website assumed not only some prior familiarity with the sources, but ready access to the printed catalogues that listed and described them. However, since the archive includes uncatalogued fragments (some discovered since the catalogues were compiled and 11 new sources discovered during the course of the project's work), and since our user-base is far broader than the musicological community, present work on the DIAMM collection concentrates on providing users with detailed intellectual metadata to accompany each image. Since many of the sources we have photographed have concordances in sources not (yet) in our collection, the website includes intellectual metadata for a much broader group of manuscripts than are available online. The project currently includes information about many fragments and manuscripts for which it has no images available, indicating DIAMM’s inherent transformation from a largely image-based (or document-based) resource to an information (or data-based) archive that includes images where possible.

In response to user feedback, the project is developing an image-annotation tool and hopes by the end of the year to place restored images of all difficult-to-read sources alongside their unrestored counterparts.

Negotiations are ongoing with libraries who prefer to not have their images online, encouraging them to join the large number who have made their images available through the DIAMM portal. The concerns of the libraries are foremost in DIAMM's planning and structures. Our collection would not exist if participating libraries were to withdraw their permission, so the demands and requirements of users have to take second place to rights protection and our relationship with our contributors. For this reason, access to the images is limited to password holders. Anyone can obtain a password for no cost by completing and returning the Website Access Agreement, which stipulates respect for the copyright status of the images – particularly important when giving access to users in countries that do not recognize copyright in images.

Without this agreement, many of the images in the online collection would not be available. Despite the tedious necessity of printing, signing, and posting a form, the number of scholars who do go to the trouble to obtain access is testimony to the relevance and interest of the collection. We receive signed access forms almost daily from all over the world, which reassures us that this necessary “barrier” is not significant to those who genuinely wish to access the sources and would prefer not to have the images obscured by a watermark or at a size that is less than useful.

Already, scholarship in Medieval Musicology is changing as a result of the DIAMM resource. Courses in Medieval history, which used to be difficult to teach in music departments, now use the website as a teaching tool and scholars who previously worked only on a tiny corner of the repertory are now broadening their outlook to embrace a much wider corpus of materials. New discoveries, made possible by the quality of the images and the success of digital restoration, are shaping scholarship on materials from this period both within and outside musicology, and the audience for these materials is constantly growing.

Notes

1External quality assurance is done by Alan Lock, an imaging expert who has helped establish DIAMM standards from the start.

2All development work at CCH is carried out by Harold Short and the technical team there. Transmission of the database between the working copy and the online version is managed by John Bradley using perl scripts originally designed by Hafed Walda, and further developed by Gerhard Brey, who is also handling the Javascript aspects of the site. Other aspects of the delivery front end and development of online tools are the province of Paul Vetch, Paul Spence, and the team of xml experts.


 Highlighted Web Site  Print this article only

PrestoSpace



prestospace


PrestoSpace
(www.prestospace.org) is a multidisciplinary, integrated project funded under the European Commission 6th Framework Programme for Research. The project was founded to address the looming impact of physical degradation and technical obsolescence on the cultural, historical, commercial, and scientific memory held within Europe’s diverse audiovisual collections. Current efforts are targeted at creating “preservation factories”: solutions, services, and tools that maximize automation and cost-quality efficiency for preserving audiovisual materials. The main initial focus is on digitization of analog materials: stock selection, digitization processes, restoration, storage, and metadata production.

To accomplish this objective, PrestoSpace project partners are “preparing the business plan, contacting potential investors and working with commercial partners to set up the actual services. These services will exploit the technological and industrial results of the project.” Resulting tools and services should benefit all kinds of audiovisual collections, small and large.

The large consortium pulls together participants from archives, research centers, universities, industry, and international non-profit institutions.

Research and development is proceeding in four major work areas:

  • Preservation Work Area: providing and integrating tools for the preservation process
  • Restoration Work Area: providing and integrating tools for the restoration process
  • Storage and Archive Management Work Area: addressing planning, financial, and management tasks in respect of processing and storage technology.
  • Metadata, Access, and Delivery Work Area: ensuring proper delivery to the archives, with access tools and rich content descriptions

Each work area is subdivided into five parallel tracts, called “workpackages,” designed to ensure tight integration and interoperability between the work areas by specifically exploring:

  • User Requirements
  • System Architecture & Specifications
  • Integration
  • Services
  • Exploitation & Tests

The PrestoSpace website features project background information, links to partners, informational links, as well as project deliverables: completed reports, such as Final report on users requirements and State of the Art of Content Analysis Tools for Video, Audio and Speech. The site can be viewed in English, French, German, Italian, and Dutch languages.


 FAQ  Print this article only

A Little Bit'll Do You (In): Checksums to the Rescue

Author: Richard Entlich - Cornell University (rge1@cornell.edu)

I’m concerned about the integrity of digital files that come into my collection. What can I do?

Introduction

Digital content is prized over its analog counterpart for its many desirable characteristics. Digital objects are compact, portable, sharable, interactive, and searchable. Much to the chagrin of certain content providers, unless elaborate technical measures are taken digital content theoretically lends itself to the creation of perfect copies, which themselves can be perfectly recopied, ad infinitum.

But digital technology does have its downsides. Most users and especially content managers are painfully aware that rapidly evolving techniques for encoding and storing digital content lead to obsolescence and that digital storage media deteriorates much faster than the most robust analog media. Obsolescence and media decay lend a certain fragility to digital content that require it to be carefully monitored and somewhat pampered, if it's to survive intact for the long term.

Another characteristic of digital content that contributes to its fragility is susceptibility to corruption. Analog objects can endure significant degradation without total loss of informational content. A coffee stain on a book page may hinder, but probably doesn't destroy its readability. A heavily scratched long-playing record may reproduce sound that is noisy and of poor fidelity, but still presents the essence of its original content.

On the other hand, depending on the nature of the object and the location of the damage, the tiniest defect can render a digital object completely non-functional or scramble its informational content beyond recognizability. A single flipped, missing, or misplaced bit can prevent an executable file from running or throw off the reading frame of a data file. Changes to files may also be quite subtle and therefore not readily apparent to the user, leading to unwarranted confidence in the reliability of the contents.

Thus, in order to be trusted for legal, financial, medical, scholarly, and other purposes, the authenticity of digital objects must be established and then maintained over time by monitoring the object's fixity, that is, ensure that any and all changes are purposeful and fully documented. Fortunately modern hardware for storing, manipulating, and transmitting digital content has intrinsically low error rates, so random errors are not particularly common. However, unanticipated changes can and do occur as a result of malfunction, media degradation, human error, and malicious intent. Therefore, an essential function of any digital object repository is the ability to monitor and detect modifications to the contents.

Checksums: Redundancy put to good use

In the course of everyday computer use, when an important file is lost or damaged, restoration of the original is normally possible because of the maintenance of extra copies or backups. A backup is the ultimate form of redundant storage; an exact and complete copy of the original. Redundancy, albeit in much more compact and less complete forms, can also be used to detect changes in digital objects. The mechanisms to do so are usually called redundancy checks or just checksums (though there are kinds of redundancy checks other than checksums).

The redundant data generated for error detection purposes is generally created by a highly specific mathematical process. The checksum is generated initially and can then be recomputed and compared to the original to make certain it hasn't changed. Checksum comparisons can be employed periodically (to ensure that fixity is being maintained over time), or at any time an object is retrieved, transferred to new media, transmitted over a network, or is processed in any way that opens the possibility of corruption.

Checksums vary considerably in their sophistication. Simple checksums are mathematically trivial to compute and thus can be generated with minimal computer processing power, but have significant limitations in their ability to detect errors and their resistance to intentional misuse.

For example, an early form of checksum used in telephone modems is the parity bit. A parity bit is an extra bit added to a block of binary data (such as a single ASCII character) computed by counting the number of '1' bits in the original character code. If the parity bit computed after transmission doesn't match the one sent with the data, an error has occurred and the receiving unit will ask for the data to be retransmitted. This primitive form of error detection catches changes to odd numbers of bits, but misses even numbers (e.g., two errors in the same character code).

Another simple form of checksum, the check digit, is used for error detection in decimal data. Some well known numerical identification schemes use check digits, including the UPC (Universal Project Code) found on nearly all packaged merchandise, and the ISBN (International Standard Book Number). Each has its own way of computing the check digit. The current 10-digit ISBN specification (due to be superceded in 2007 by a 13-digit version) uses the final digit as a check digit.

Consider, for example, the ISBN 0-9700225-0-6. To verify that an ISBN is valid, one multiplies the first nine digits, starting from the left end, by that digit's position number, starting with one and then computes the sum of the nine products. In this case, the sum would be [0x1+9x2+7x3+0x4+0x5+2x6+2x7+5x8+0x9 = 105]. One then divides the sum by 11, which gives 9 with a remainder of 6. It is this remainder that forms the check digit. In our example, it matches the 10th digit, so the ISBN is valid. The ISBN check digit can detect any change in any single digit.

Parity bits and check digits are designed to detect a subset of common errors in very small, predefined blocks of data. But what about large and arbitrarily constructed data such as entire computer files? Checksums can also be used for error detection in large blocks of data. Simplistically, one could add up the numbers corresponding to the codes for every byte in a file and reduce the file to a single number. Such a scheme is far too weak to be used to assure fixity, since many files could potentially generate the same checksum and with a bit of effort, a file could be intentionally modified without changing its checksum.

Stronger checksums employ more sophisticated mathematical techniques in their calculations. A popular kind of checksum called a CRC (cyclic redundancy check) uses polynomial arithmetic and takes both the value and the position of each byte of data into account, thus greatly reducing the likelihood that a random error will produce a file with the same checksum. However even CRCs can be manipulated to allow a file to be intentionally changed without changing its CRC value.

The more advanced checksums employ a mathematical construct called a hash function that takes an arbitrarily long input file and produces a short (usually between 128 and 512 bits), uniform length string of characters that has a very high probability of being unique to that file. The strings generated by these algorithms are sometimes also called digital signatures, suggesting that they uniquely identify a file. The most sophisticated algorithms use hash functions based on cryptographic techniques. Two well-known examples are MD5 (Message Digest 5) and SHA-1 (Secure Hash Algorithm 1).

These modern algorithms are designed to eliminate collisions (two different files producing the same checksum) and to be highly resistant to the construction of files that produce checksums identical to those produced by other known files. Both have gone through version upgrades in response to weaknesses found in their cryptographic aspects (e.g., SHA-1 replaced SHA-0 and MD5 replaced MD4). In fact, SHA was developed by the US National Security Agency after flaws were found in MD5.

Recently, it has been reported by cryptographers even the SHA-1 algorithm has some weaknesses. What these flaws have typically meant is that, with a huge investment in computing resources (typically requiring a supercomputer), it has been possible to produce two files that produce the same checksum. These weaknesses are of great interest to cryptographers and institutions that work with top secret data, but they should not dissuade digital repository managers from using the algorithms for routine detection of unanticipated and unauthorized changes in files.

Practical considerations

Though there are many different types of checksum algorithms, only a few are commonly used for fixity determination. MD5 is probably the most widely employed. It is unambiguous (CRC can refer to a number of different, incompatible algorithms) and software to generate MD5s called md5sum is standard on most Unix and Unix-like systems, including Linux and MacOS X. Md5sum is also available as part of the GNU core utilities. There are a number of freeware MD5 generators for Windows, going under names such as MD5 and md5summer. Checksum generators are also included as a function in other repository management software. For example, JHOVE (JSTOR/Harvard Object Validation Environment) is capable of generating CRC32, MD5, and SHA-1 checksums.

Checksums have utility throughout the chain of creation and custody of digital objects. For example a checksum can be generated when an analog object is newly digitized and used to confirm fixity throughout processing. Digitization work that is outsourced should specify that the vendor create a checksum so the receiving institution can verify the fixity of the files upon receipt. Obviously, any time the file is intentionally changed, for example to correct problems found during quality control, a new checksum will have to be generated.

Thus, like other metadata that describes and characterizes a digital file, checksums are stored separately but remain associated with particular digital objects. In addition to the checksum itself, the metadata should identify the specific type and version of checksum used (unless only one type is used throughout) as well as the date and time it was last computed. Such metadata should be afforded at least as much security (access controls, regular backup, etc.) as the digital objects themselves. Obviously if a digital object is intentionally altered and the perpetrator also has the ability to modify the checksum data, checksum verification will be ineffective in detecting the change.

There are some very fast algorithms now available for computing even the most sophisticated cryptographically-based checksums. However, the process does require reading through the entire file, so if checksums are to be mass-generated on a large number of files or on very large files, the time and computing power required should not be underestimated. The best way to determine whether the load is tolerable is to run some tests on representative files, since the actual time required will depend on the specific software used, the CPU speed, storage and network architecture, average file size, and the nature of competing demands on the computing infrastructure.

Caveats and conclusions

It is important not to overestimate what the use of checksums can accomplish. In particular, confidence in the authenticity of a file requires an uninterrupted chain of custody. This means, for example, that a file downloaded from the Internet that is provided with a checksum that can be verified after download may nevertheless be something other than what it claims. Even if the checksums match, all that proves is that the download wasn't corrupted. Without knowing the provenance of the file, you can't assume it is what it claims to be.

Checksums have other limitations. Checksum software may be too specialized a tool for certain applications. For instance, they can't be used to determine whether two word processing files in different formats contain the same document, or whether two audio files contain the same song recorded at different volume levels. Work using concepts similar to those in checksum generation is being done to address the problem of content identification.

Appropriate use of high quality checksums can improve confidence in the integrity of stored digital objects and allow monitoring and detection of unauthorized changes to files from many causes. Due to their fragility and susceptibility to modification, digital objects can only be trusted if their fixity is routinely confirmed and checksums are an essential tool for meeting that need.


 Calendar of Events  Print this article only





Digital Preservation Management: Short-Term Solutions to Long-Term Problems
July 17–22, 2005
Ithaca, New York

Ther are a few remaining slots available for Cornell University Library’s summer offering of its digital preservation management workshop.

DCC and DPC Joint Workshop on Digital Curation Cost Models
July 26, 2005
London, UK

This one-day workshop will be held at the British Library and features an international slate of speakers addressing usable and practical cost models for sustainable digital preservation programs. Presentations will include case studies and examples of current cost modeling activities.

Digital Libraries á la Carte: Choices for the Future
August 21-26, 2005
Tilburg, The Netherlands

The International Ticer School will offer this modular-style course for librarians and publishers. Participants may choose to attend up to five of the following one day modules: 1. Trends and strategic issues for libraries, 2. Technological developments: Threats and opportunities for libraries, 3. Library consortia and licensing, 4. Open access and institutional repositories, and 5. Libraries and teaching and learning.

DRH 2005: Digital Resources for the Humanities
September 4-8, 2005
Lancaster, United Kingdom

The tenth Digital Resources in the Humanities conference will focus on the critical evaluation of the use of digital resources in the arts and humanities and highlight methodologies used to study technologies in scholarly research in these disciplines.

European Conference on Digital Libraries
September 18-23, 2005
Vienna, Austria

The preliminary program for workshops and tutorials of the ninth annual European Conference on Research and Advanced Technologies for Digital Libraries has been posted on the conference website. In addition, the International Web Archiving Workshop and Digital Preservation will be held in conjunction with this conference September 22-23, 2005.

Building the Info Grid: Digital Library Technologies and Services
September 26-27, 2005
Copenhagen, Denmark

The seminar will feature speakers from prominent institutions and projects addressing the themes of global, national, and local collaboration and/or competition; service oriented architectures; and identity and rights management. A related event, the European Fedora User Meeting will be held on September 28, 2005 in Copenhagen.

Digital Futures: From Digitisation to Delivery
September 26-30, 2005
London, UK

Co-sponsored by King's College London and OCLC-PICA, this limited enrollment 5-day training event will center on the creation, delivery, and preservation of digital resources. Special events include visits to the UK’s National Gallery and British Library to tour their digitization related activities.

DIGITS FUGIT! Preserving Knowledge into the Future
November 3-5, 2005
Boston, Massachusetts

Online registration and a preliminary program are now available for the Museum Computer Network (MCN) Annual Conference. The meeting will feature workshops, panel sessions, tours, and a vendor exhibition hall. This year’s keynote speaker will be Alexander Rose, Executive Director of the Long Now Foundation.


 Announcements  Print this article only





JHOVE Production Version 1.0 Released

The JHOVE software package, JSTOR/Harvard Object Validation Environment, has been released in its first production version. JHOVE provides functions to identify the format of a digital object and to determine how well the object conforms to the format specification. JHOVE’s earliest release was spotlighted in the October 2003 Highlighted Web Site.

PDF/A Approved

The PDF Archival standard, Document Management—Electronic Document File Format for Long-term Preservation—Part 1: Use of PDF 1.4 (PDF/A-1) has been approved as an International Standard by the International Organization for Standardization.

Museums and the Web 2005: Best of the Web Awards

The Museums and the Web 2005’s Best of the Web Competition committee has announced their winners. The annual competition is held in association with the Archives and Informatics conference. This year's winners are in a variety of categories including: On-line Exhibition, E-Services, Innovative or Experimental Application, Museum Professional's Site, and Research Site. The winner for Best Overall Museum Web Site was The Science Museum’s (London, UK) Making the Modern World Online - Stories about the lives we've made.

The Long-term Preservation of Authentic Electronic Records: Findings of the InterPARES Project

The report of the first phase of the InterPARES (International Research on Permanent Authentic Records in Electronic Systems) project has been released. The report “focuses on the preservation of the authenticity of records created and/or maintained in databases and document management systems in the course of administrative activities.”

UK Web Archive is Now Online

The UK Web Archiving Consortium has made its first set of archived websites available online. The searchable set of sites in the archive were selected for their scholarly, cultural, and scientific value. This archive is “aimed at the broad research community and marks the first systematic attempt to create an archive of social, historic and culturally significant web-based material from the UK domain.”

Digital Archiving and Long-Term Preservation Program (DIGARCH) Awards

The Library of Congress National Digital Information Infrastructure and Preservation Program (NDIIPP) and the National Science Foundation have awarded eleven university teams a total of $3 million for research to support the long-term management of digital information. The awards were issued through the Digital Archiving and Long-Term Preservation program (DIGARCH), a joint initiative between NDIIP and NSF to develop the first digital-preservation research grants program.


 RLG News  Print this article only

PREMIS Final Report Released; Certification of Digital Archives Project Begun



Final Report of the PREMIS Working Group is Released
The joint OCLC-RLG PREMIS (PREservation Metadata Implementation Strategies) working group recently released its final products. The products address the project's objectives to:

  • Develop a core preservation metadata set, supported by a data dictionary, with broad applicability across the digital preservation community.
  • Identify and evaluate alternative strategies for encoding, storing, and managing preservation metadata in digital preservation systems.

The 237-page publication, Data Dictionary for Preservation Metadata: Final Report of the PREMIS Working Group (May 2005) is available from the PREMIS working group web site. The document includes the PREMIS Working Group Final Report, the Data Dictionary (version 1.0), and examples illustrating the use of the PREMIS data dictionary for several types of digital objects and preservation contexts. These sections are also available as separate documents:

Completing the work of the group, five supporting XML schema have been created to allow for implementation of the core metadata element set. These are available from the new PREMIS Maintenance Activity web site hosted by the Library of Congress. They will be maintained in the Network Development and MARC Standards Office of the Library of Congress.

RLG to work with CRL on Certification of Digital Archives Project
The Center for Research Libraries, based in Chicago, has invited Robin Dale, RLG's digital preservation expert, to be the director of a project that will test ways to audit and certify digital archives. Creating auditing procedures for digital repositories will help ensure the future integrity and accessibility of electronic journals and other scholarly materials.

The 18-month Certification of Digital Archives Project is funded by a $433,000 grant from The Andrew W. Mellon Foundation. It builds on the work of the Digital Repository Certification Task Force, led by RLG and the National Archives and Records Administration. The CRL project will test and refine the metrics and instruments developed by the RLG-NARA task force by auditing actual repositories, including the Portico archive of E-journals maintained by Ithaka Harbors, Inc. and the archive of Elsevier journals maintained by the Koninklijke Bibliotheek.

Bernard Reilly, CRL president, is the principal investigator for the project. Robin Dale will continue to serve as program officer in RLG's Mountain View office during the CRL project, devoting a quarter of her time to RLG projects and activities.

For more information, see the CRL press release or contact Robin Dale.


 Publishing Information  Print this article only





RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org.

Materials in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given to use material found here for research purposes or private study. When citing RLG DigiNews, include the article title and author referenced plus "RLG DigiNews." Any uses other than for research or private study require written permission from RLG and/or the author of the article. To receive this, and prior to using RLG DigiNews contents in any presentations or materials you share with others, please contact Jennifer Hartzell (jlh@notes.rlg.org), RLG Corporate Communications.

Please send comments and questions about this or other issues to the RLG DigiNews editors.

Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); FAQ Editor: Richard Entlich; Contributor & Copy Editor: Ellie Buckley; Production: Jenn Colt-Demaree, Carla DeMello; Advisor: Peter Hirtle.


All links in this issue were confirmed accurate as of June 15, 2005.




 
Home  |   About RLG   |  Projects  |  Products & Services  |  Publications  |  Support
Usage Statistics  |  Contact Us  |  About This Site  |  Copyright & Permissions  |  Site Map  |  © 2006 RLG
 
  About RLG home
  Mission & goals
  Members
  Board of directors
  Organization
  Events
  News
  Discussion lists
  Jobs
  Contact us
  Projects home
  Projects by goal
  Current projects
  Past work
  Guides & tools
  Working groups
  Products & services home
  Online databases
  Resource sharing & interlending
  Technical services
  Purchasing background
  Publications home
  Newsletters
  Symposium proceedings
  Books & reports
  Publications order form
  Support home
  Usage statistics
  Service schedules
  LI list
  Support contacts