RLG
 Feature Article 1  

Web Archive Activities in Denmark

Author: Birte Christensen-Dalsgaard - State and University Library Denmark (bcd@statsbiblioteket.dk)

Introduction
In 1997, Denmark extended its legal deposit law to include static digital documents. The implementation was, and is still, based on voluntary registration at www.pligtaflevering.dk, after which the document is checked to see if it should be deposited according to the law. If so, it is harvested into a specially developed system. Despite a good deal of publicity, initial uptake was slow. A major information campaign was repeated with greater success in 2003 and resulted in almost 7,000 publications being registered, of which a third were monographs and the rest periodical literature.

At the time the law was formulated, the notion of depositing static documents seemed an acceptable compromise between the various stakeholders including authors, publishers, the government, and researchers. It soon became clear, however, that the law was too restrictive compared to the actual development of the Internet. The fact that online newspapers were not subject to legal deposit was just one example.

The inadequacy of the law became increasingly apparent and in 2001 a number of initiatives were undertaken to increase public awareness. These were supported by the Danish Electronic Library and the Danish Ministry of Culture and included:

All of these activities and their conclusions are influencing the work being done right now to update the legal deposit law to cover Internet materials, as well as national radio and television broadcasts.

The national libraries in Denmark worked to involve media researchers in the PPF conference and various projects right from the beginning. The research community was represented by the Centre for Internet Research, at  the University of Aarhus. Its participation was seen as key to optimizing the chance that the collection policy would satisfy researchers. From a publicity point of view, this turned out to be very important as most newspapers and news channels wanted the media researchers' point of view as they best represented the perspective of future users.

To get input into the technical, legal, and organisational issues, two technical trials were conducted: one concerning the municipal election in 2001 and one testing various harvesting methods.

This paper is based on this work and describes our experience and some of our results. As is clear from the acknowledgment at the end of this paper, this paper presents the work of the whole project team.

Selection—What to Include?
The principle behind the legal deposit law is to ensure that all materials published of relevance for Denmark are collected and preserved for future use. The intent has been to collect comprehensively rather than selectively on the assupmtion that it is difficult to predict what will prove most valuable in the future. For example, the collection of house-distributed warehouse catalogues seemed strange in the 1930s, but these have been very much in demand now as they give a picture of society at that time.

The Internet today is not only a method for disseminating information, it has become the place for a whole range of activities from chatting to shopping, from getting advice to filling in tax returns. At the PPF conference, the activities of three generations of a family were evaluated [4] in terms of what would be collected through an extrapolation of the existing legal deposit law. The conclusion was that the current law was inadquate and that we presently lose relevant information. This was made very clear during our first technical trial of harvesting 80 Web sites.   The rationale for choosing 80 sites for selective harvesting was based on a combination of arguments such as coverage of rapidly changing sites, general geographical and demographical coverage, and the complexity of the site.

Researchers identified relevant object types for the event-based collection concerning the municipal election. The list included SMS (short message service), shoutboxes, chatrooms, quickpolls, games, etc., on top of "normal" Web pages. Researchers concluded that all of these forms of expression and documentation were essential; for example, how parties tried to attract the interest of young people was best revealed in these "alternative" materials.

Strategy
Based on the experience of other countries and on the analysis done as part of our trials, we proposed a hybrid strategy based on bulk harvesting four times a year combined with selective harvesting of approximately 80 sites, combined with 2-3 yearly event-based harvests.

The argument for the hybrid approach is illustrated in Figure 1 below. The vertical axis, Changes, indicates the change frequency of information on various Web sites. The lower limit "static" indicates that the materials are put on the Web and never changed; the upper limit "live" indicates a great rate of change. For materials undergoing rare updates, occasional visits by the harvester will catch the whole site. Rapidly changing sites are more difficult to preserve. A good example is a newspaper Web site in which sports scores are given as they become available [1]. Looking at relevant approaches we found that the bulk harvest strategy, focusing on breadth, is suitable in the rare to infrequent change range, whereas the selective harvesting, focusing on depth and harvester configuration, is best in the frequent change range.

The horizontal scale on Figure 1 shows the level of interactivity, which may relate to the complexity involved in harvesting the materials. Some of the materials with high interactivity, like database systems, can be impossible to harvest automatically and hence require manual configuration of the harvester. One may generalise and say that the lower part is the surface Web and the upper part is the deep Web. This is often true, but not always. For example, an interactive story (medium to high interactivity) may be developed using Flash animation, which is quite easy to harvest (but difficult to logically preserve). Figure 1

Figure 1: Interactivity and change frequency of various types of Web sites

Difficult Sites: Complex Application and Complex Format
For a number of reasons a Web site may be difficult to handle. Here we will deal with two:

  • the Web site is difficult to harvest due to frequent updates or because it belongs to the deep Web
  • the Web site contains objects that are difficult to preserve

A highly interactive site may be easy to harvest as illustrated by "Soldaten i baghaven" [Soldier in the backyard], an interesting and interactive site meant to tell a story for youngsters. It illustrates why the term "generally" was introduced when high interactivity was equated to the deep Web. It was developed using Macromedia Flash and is one big file with all its interactivity embedded. This site is easy to harvest, but very risky to preserve. It seems that today emulation is the only real option for such a site.

The problems of obtaining files and of preserving/presenting files are to some degree orthogonal, as illustrated in Figure 2, and require two different types of solutions. A set of files may be straightforward to obtain yet very difficult to preserve in the long term, or hard-to-obtain files may turn out to be plain text that can easily be preserved and viewed in the future.

Figure 2 

Figure 2: The problems of archiving and viewing files are mostly orthogonal to the problems of complex digital applications

It is clear that a national net archive should take the deep Web into consideration or risk missing the majority of what it sets out to archive. But in order to do so, we must identify the issues that make deep Web sites so difficult to harvest. In our trials, six problem areas were identified:

These six categories present problems for a crawler, but in different ways and with different implications for the approach. The project team identified a range of problems and formulated lots of ideas; however, none were tested. An approach to get part of the deep Web through harvesting via Z39.50 and OAI protocols may be tested in the fall of 2004.

The other axis of figure 2, the file format problem, is an essential part of long-term preservation of digital objects. Very few digital objects can be read without some kind of interpreter, and it is uncertain which, if any, of the current interpreters will be available and functioning after 50 or 100 years. We will return to formats later.

Harvester and Archive Format
Several harvesters have been used and tested in different stages of our trials. We started using WGET, as part of Project NEDLIB (Networked European Deposit Library), and a commercial product. In the present trial we used HTTrack for selective harvesting and Heritrix for bulk harvesting. The main argument for using HTTrack was that, at the time of the trials, it performed better on selected sites. Heritrix will be used in the future for both depth and breadth harvesting.

Determining suitable archive formats is a daunting challenge. We began by identifying requisite features: the format must be OAIS compatible, must be suited for long-term storage, and must support all Internet protocols and metadata. Further it must support data integrity and it must be possible to retrieve the original bit-stream.

pullquoteDifferent archive formats for storing the data were investigated, among these the METS format of the Library of Congress and the ARC format [2] designed by the Internet Archive for its archival systems. None of the investigated formats satisfied all requirements, but the ARC format came closest. A number of actions were performed with the archive format to test usability and suitability. ARC files were written both with Heritrix (native) and HTTrack (using a new module). It was demonstrated that the data (including metadata) could be written and read and displayed again. The data harvested using NEDLIB were converted into the ARC format as part of the NWA Project (Nordic Web Archive Project).The ARC format has been extended to allow storage of converted files. A conversion tool that performs batch conversion of files stored in ARC files has been implemented.

A detailed analysis was completed to identify which metadata could be generated automatically in connection with the harvesting and which required manual interference. Concerning bibliographical metadata, the working group under the Ministry of Culture [3] recommends:

that the material from the Internet, which is collected according to the envisioned harvesting strategies, is not registered on the level of the individual work. Instead it is recommended that a registration will take place for groups of materials-such that metadata are created for each instance of the harvesting (ingest). These metadata might contain the time for the harvest, technical aspects of the harvesting, the resulting volume and quality, or the content. The individual documents can easily be made subject of search through e.g. indexes.

Even though all do not officially accept this strategy, it is the strategy adopted by the Web archive consortia and the strategy applied in the present trials.

The project has investigated quite thoroughly which technical and administrative metadata can be created as part of the harvesting and which can be embedded in the ARC file.

Online news sites constitute an important source of information and it is important to include them as part of the Web archive. Example—Harvest of a Newspaper Site
Online news sites constitute an important source of information and it is important to include them as part of the Web archive.

Jyllands-Posten (JP) is Denmark's largest daily newspaper and also one of the leading online newspapers. JP's site is updated several times daily (on March 11 the front page was updated 260 times) and has contents ranging from breaking news and reader's opinion to classified adds and TV guides. Because of its variety, its importance for preservation, and its frequent updates, JP was considered a very suitable test case for the archive.

The approach chosen was to receive notifications from JP whenever updates took place. Jyllands-Posten agreed to provide a continuous log of change notifications for its Web site that could be accessed via HTTP. Below is an extract showing the timestamp and the URL of the changed objects.

2004-03-11-10:09:11 /forside/
2004-03-11-10:09:11 /common/forside_mid_bottom_article_list:section_name=Udland
2004-03-11-10:09:11 /common/right_article_list:section_name=Udland
2004-03-11-10:09:11 /common/right_articles_forside:section_name=Forside
2004-03-11-10:09:11 /seneste/

Figure 3: Extract of log from Jyllands-Posten

The log file was downloaded at an appropriate frequency and was used as a basis for selective harvesting of the relevant objects.

The advantages of this approach are:

  • less load on both JP's system and the archive's harvester in comparison to traditional harvesting
  • JP puts relatively little effort into making this work
  • virtually all updates of JP's site can be captured
  • although this is a form of deposit, the materials are acquired in the same manner a consumer would obtain it (via HTTP)
  • the implementation of the necessary extension to the harvester is of moderate size and can be reused with other producers

Disadvantages are:

  • we are dependent on the accuracy of the logs provided by the producer. For example, the archive needs to ensure that any bugs resulting in the lack of references to certain materials in the log is discovered (e.g., with random quality checks)
  • system migrations on the producer side may require modifications to the log generator, potentially introducing errors and additional cost

An important achievement of the project was the collaboration with other newspapers in Denmark, allowing the approach to be extended to other news sites in the near future.

Preservation and Formats
The cost for establishing an organisation and an infrastructure that can perform the job of archiving the Internet is currently being calculated. The costs projections cover all known aspects such as the establishment of an organisation involving representatives from stakeholders and establishing and maintaining a trusted repository, to mention two of the important points from a long list.

One of the less-precisely specified areas is the logical preservation part. Some preliminary work has been done as part of the second trial, where the area of file formats was investigated and documented by Lars Clausen. Some extracts from this report are discussed below.

Five aspects were identified as relevant for the discussion of what to preserve.

Readability: A minimum requirement must be that the core elements can be read.

Comprehensibility: Most text documents have more to them than just the raw text. Data may be lined up in columns, arrows may point at important features, text attributes may indicate particularly important words, etc.

Appearance: Some attributes of a file format are not necessary in order to understand the meaning of a file, but are part of the overall impression.

Functionality: Unlike analog objects, digital objects often have functionality beyond that of visual and audio characteristics.

"Look & Feel": A perfect copy of a digital object would preserve not only the appearance and functionality of the original, but the entire "look & feel," for example, the design and operational quirks of GUI elements, the resolution of the monitor, and even the speed of the machine.

It is not known what aspects will be considered important in the future. When the Danish newspaper archives were started, most people expected the news articles to be the significant part, but current researchers are no less interested in obituaries and advertisements. Similarly, a future researcher may be interested in today's layout techniques, interaction models, or other features that we haven't even considered.

pullquoteWe face a trade-off between how much we can preserve and the resources we can spend on preserving it. It would make little sense to allocate many resources to correct preservation of a file format that appears only a few times in a billion object archive.

NWA software has been extended to support the ARC format, and tests have been made showing materials coming from different sources (harvested by Heritrix and by HTTrack). Also, a system has been developed based on NWA that can be used to test the harvested materials for completeness. If links are missing, they will automatically be collected by the quality check software and harvested.

Legal Aspects
As mentioned in the beginning, the trial has been used to identify issues to be addressed in a new legal deposit law. It has been used to prove the technical feasibility, identify the relevant organisational structure, and to identify potential legal issues. Examples of issues raised are the deposit libraries' right to get access to Danish domain names and information about domain owners, the right to make copies for preservation purposes even if they change the format and functionality of the document, and rights associated with access for research purposes and by the general public. As these issues are in general closely linked to the national copyright regulation and rules concerning protection of privacy, they are not discussed here.

Acknowledgement
The work described above has been done with great teamwork between people at the Royal Library and at the State and University Library. I want to thank Steen Slot Christensen, Niels Christensen, Tue Larsen, and Søren Carlsen from the Royal Library in Copenhagen and Thomas Zäschke, Bjarne Andersen, Lars Clausen, Harald van Hielmcrone, and Frank Sørensen from the State and University Library in Aarhus for their engagement and creative and constructive attitude during the whole phase. A special thank is due to Birgit Henriksen for continuous discussions and inspiration during the whole process.

The work has been supported by the Danish Electronic Library (DEF) and by the Ministry of Culture.

References
[1] Niels Brügger, The last page of the internet? The importance of Preserving the Dynamic Aspects of the Internet
[2] Mike Burner and Brewster Kahle, WWW Archive File Format Specification, September 15, 1996


Copyright 2004 RLG.