![]() |
|||||||||||||||||||||||||||
| April 15, 2003, Volume 7, Number 2 | ISSN 1093-5371 |
||||||||||||||||||||||||||
|
The Paradigma Project Carol van Nuys Growth of Web Archiving in Europe Digital documents of all kinds are disappearing daily, and with them the opportunity for new generations of readers to study and enjoy today's documents in the future. The preservation of our digital cultural heritage is an increasingly important and challenging issue. In response to the situation, about fifteen European countries have started some type of Web archiving activity.[1] Different countries have chosen different collection strategies: Denmark and Australia have taken the selective approach; Sweden, Iceland, and Finland have harvested their entire national Web spaces; and the National Library of the Netherlands has made an agreement with the Dutch Publishers' Association (NUV)[2] for the deposit of electronic publications offline and online. Only five of the countries that are involved in Web archiving can base their work on legal deposit legislation, and Norway is one of them.[3] Background on Legal Deposit in Norway Legal deposit has a long tradition in Norway. The first Legal Deposit Act for Denmark/Norway was passed in 1697, and censorship undoubtedly played an important role in its establishment. The law remained in force until the Union with Denmark was dissolved in 1814. A royal decree on legal deposit was passed in 1815, followed by a new Legal Deposit Act in 1882. This again was succeeded by the Legal Deposit Act of 9 June 1939. The common denominator of all these acts was that they included printed material only. However, as new media developed, the need to pass a new and extended law became more and more evident. The present Legal Deposit Act was thus passed on 9 June 1989, and, of course, the main intent of this law was no longer censorship, but cultural preservation. The National Library of Norway's current Web archiving work is strongly influenced by the Norwegian Legal Deposit Act. The purpose of this act is to
Considered extremely modern when it was passed in 1989, the act covers all generally available Norwegian documents stored in any medium, including paper, microforms, photographs, combined documents, sound recordings, films, video, electronic publications, and broadcast programs. It also covers documents published abroad for Norwegian publishers and those specially adapted for a Norwegian public. The act does not cover documents found in closed networks, computer software, documents accessible only through a company or organization's intranet, net communications (i.e., e-mail or closed discussion and chat groups of a private nature), archival material covered by other legislation, or official governmental publications. Chapter 9 of the act's regulations (§ 30, second subsection) states:
We can easily see that the act and its regulations were written before the World Wide Web arrived. Filling the request for two copies of each generally available Norwegian Web document is simply impossible. Today, the National Library is investigating the most-effective ways to fulfill the intent of the act as applied to digital documents and is considering the possibility of using a combination of different collection approaches.
Overview of the Paradigma Project The Paradigma Project[4] began in August 2001. Its goals are to develop and establish routines for the selection, collection, description, identification, and storage of all types of digital documents and to give users access to these publications in compliance with the Legal Deposit Act. The project is scheduled to end on December 31, 2004. Paradigma's activities fall within the bibliographic, technical, and legal areas, as reflected in its eight work packages:
Currently the project continues the National Library's earlier work in several of these areas. At present, activities from several of the work packages are under way or completed. The following sections highlight the work connected to the legal deposit of Web materials. Aspects of the Collection Strategy Selection Criteria
There are several reasons for taking this general harvesting approach. First, we cannot predict which documents will be of value in future research and documentation. Second, digital storage is becoming cheaper every day. Third, unfiltered harvesting saves us from resource-consuming manual selection at harvesting time. Finally, a Web Archive user can find documents via free-text search functions, thus being able to review all documents, including those that do not qualify for manual cataloging. Selection criteria for any use, such as further bibliographic description, can be challenged and changed at any time. This would, of course, be impossible if the material were excluded at harvesting time. Total harvesting of the Norwegian Web space does not exclude the library's use of other collection strategies as well. The Legal Deposit Division carries out event-based collecting. It has collected, for example, the Web sites belonging to political parties prior to, during, and after elections. This type of capture activity will continue to supplement future routine harvesting rounds. A selection of Web documents is currently harvested semi-manually using the HTTrack software, and these are cataloged for the National Library's catalog (BIBSYS). This activity will continue until the Paradigma Project's general harvesting activity and related procedures are fully established. In many cases other methods must be used to collect digital documents. The Legal Deposit Division has already contacted Norwegian publishers about the deposit of e-books, and the library's Sound and Image Archive is working with the Norwegian Broadcasting Corporation on solutions for the deposit of "born-digital" radio and television programs. However, a large amount of administrative, legal, and technical work remains, and the deposit of dynamic publications (e.g., Web newspapers and electronic materials of all types that are stored in databases) is especially challenging. The Paradigma Project will address these problems as the project continues. Bibliographic Description Today the National Library of Norway registers different types of material in various ways. Ephemeral material is given an abbreviated cataloging treatment, while books and serials are given a full bibliographic description, both in the library's catalog and in the National Bibliography. The Paradigma Project estimates that less than 1% of the material collected from the Norwegian Web space may be subject to individual manual treatment or registration at some level. After surveying selection criteria used in other countries and in the National Library's own divisions, the project suggested selection criteria and harvesting frequencies for new types of electronic publications that are based on content (genre). We also suggested a typology based on Shepherd and Watters's[5] work and have used three main types of digital documents: traditional, i.e., similar to printed documents (monographs, periodicals, reference works, etc.); transient, i.e., based on traditional forms but extended with new functionality (net newspapers, Internet novels, etc.); and new, i.e., previously nonexistant, such as blogs and Web portals. Automatic Processing and Analysis We are currently investigating the use of automatic analysis and extraction of information (metadata) from Web documents. Such analysis can be used to generate "weighted" hit lists, thus helping librarians to select documents for manual registration. The technology is not yet good enough to determine a document's type automatically, but it can help to reduce the number of documents that require human intervention. For documents that are not evaluated manually, properties of a document type that are automatically captured can be made available for structured searching in the Web Archive. The value of these properties will be limited but, in combination with other search criteria, may indeed prove useful. Metadata and Unique Identification The Paradigma Project is also surveying metadata standards for the description of digital documents and for the exchange of bibliographic data. These recommendations may form the basis for a service to publishers and other interested parties, allowing them to generate metadata descriptions for their digital documents before legal-deposit delivery. The library must be able to handle a huge number of small data objects automatically, and it will need to identify each component (text file, picture file, sound file) in a single Web document. We are currently surveying standards for identification, and we will suggest how to improve the library's existing identifier allocation service. One enhancement would be the ability to handle chronological versions of a Web document. Scope of the Norwegian Internet Domain Size The exact size of the Norwegian Internet domain is unknown at this time. The first harvesting round, in December 2002, resulted in some 3.1 million URLs, of which approximately 53% were images (.jpg, .gif, .png). The NEDLIB-harvester[6] started with about 1,000 initial URLs, and harvesting was limited to the HTTP protocol, to the Norwegian national domain (".no"), and to URLs without a search query attached. Assuming a distribution similar to that found in Sweden and Finland, we expect to find 45% to 55% of the Norwegian Internet sites in domains outside .no. We expect future rounds to span roughly ten million URLs, especially when we include Norwegian sites in domains like .org, .net, and .com, as well as URLs with search queries. Volume The first harvesting round retrieved files requiring 140GB of space in the National Library's Long-Term Preservation Repository. File sizes will probably grow in the future. The space requirement estimates for the Norwegian Web space are based on an average of 100KB per URL. We expect the first complete harvesting round to be approximately 10 million URLs, thus filling around 1TB. 1 TByte represents roughly 1% of the total capacity of the Long-Term
Preservation Repository. We expect that less than 10% of the storage capacity
will be used by the Web Archive, even if both the number of objects and
their average size grow drastically in the future. Issues of Access Strategy Providing access for users to the deposited collection of digital documents is a complex matter that is regulated by legislation and relies on technical mechanisms. Legal Deposit Act Section 1 of the Legal Deposit Act restricts access to source material for purposes of "research and documentation." These terms are not defined in the act itself, so the underlying intent of the act must be studied in a bill from 1988-89.[7] Loosely translated, that document says:
Using this document as a guide, the National Library interprets research to mean investigation or inquiry at a certain scholarly or scientific level and documentation to be investigation or study without the same status as research in the traditional meaning of the word, but based on a systematic use of source material. The general public has never been defined as a user of the traditional legal-deposit materials, but because public libraries generally do not maintain collections of previously published digital documents in the same way they maintain collections of traditional material, the Paradigma Project has recommended that a larger user group be given access to the deposited digital collection in the future.
The conflict is understandable, considering that many digital documents are associated with commercial interests. A single electronic item on the loose can quickly be distributed all over the globe, possibly resulting in economic loss for the copyright owner. Digital documents can easily be misused (copied, manipulated, etc.). For that reason the National Library can give access to the Web Archive only to users defined in the Legal Deposit Act and then only from a PC designated for such use on the library's premises.
Norway is bound by several international copyright conventions. The recently passed Common Market Directive 2001/29/EF (22 May 2001) on the harmonization of copyright law is scheduled to be implemented legally in Norway this year. We are watching this process closely, as it can influence the way in which the National Library allows access to its Web Archive. The purpose of the Personal Data Act[9] is to protect persons from violations of their right to privacy through the processing of personal data. The National Library must process the digital documents that have been collected from the Norwegian Web space. Because many of these documents may contain personal data, the library received permission from the Data Inspectorate before initiating the first harvesting round. We are now authorized to collect and store Web material in 2003, but before giving access to the collection, we must secure permanent permission to do so. Nordic Web Archive (Access Module) For user access to the Web Archive, the Paradigma Project selected the Access Module developed by the Nordic Web Archive Project.[10] (NWA). The five Nordic national libraries have now embarked on the next project, NWA-II, in which this software will be further developed. We plan to adapt the NWA Access Module to accommodate several special-user functions, including tailored interfaces for catalogers, program operators, and library patrons. This user interface will show a timeline enabling users to select different versions of the same document as captured on specific dates. The NWA Access Module may play an important part in the collaboration between the Internet Archive and several national libraries in their combined efforts to develop software in the projected National Library Web Archive Consortium. Our Digital Cultural Heritage The Paradigma Project's work will be finished in two years. Hopefully, by then the National Library of Norway will have the technology, methods, and organization necessary to enforce the Legal Deposit Act—also for the many documents that are born digital. Footnotes
Publishing Information RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org. Materials in RLG DigiNews are subject to copyright and other proprietary rights.
Permission is hereby given to use material found here for research purposes
or private study. When citing RLG DigiNews, include the article title and author
referenced plus "RLG DigiNews, Please send comments and questions about this or other issues to the RLG DigiNews editors. Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); Technical Researcher: Richard Entlich; Contributor: Erica Olsen; Copy Editor: Martha Crowe; Production Coordinator: Carla DeMello; Assistant: Valerie Jacoski. All links in this issue were confirmed accurate as of April 15, 2003.
|
|||||||||||||||||||||||||||
| |
|
|
|
|
|
|
|
|