| |
Researching
Long Term Digital Preservation Approaches in the Dutch Digital Preservation
Testbed (Testbed Digitale Bewaring)
Maureen Potter
Digital Preservation Testbed, Netherlands
Maureen.Potter@ictu.nl
In 1996, the Netherlands Ministry of the Interior and the Ministry
of Education, Culture and Sciences initiated a collaborative programme
entitled Digital Longevity (Digitale Duurzaamheid). This programme,
run in conjunction with the National Archives, sponsored Jeff Rothenberg's
1999 publication, Carrying Authentic, Understandable and Usable
Records Through Time, which proposed establishing a testbed to
carry out research into possible approaches for the long term digital
preservation of archival records (1). The
Digital Preservation Testbed (Testbed Digitale Bewaring) was born
the following year.
This article introduces the work of the Digital Preservation Testbed.
It first places the Testbed in context within the rest of the Digital
Longevity programme and defines the scope and goals of the project.
Our objectives and research questions are identified, followed by
a review of the rigorous scientific approach that the Testbed takes
in its experiments. The benefits of this are highlighted, as is the
practical nature of the Testbed. Finally, the products and deliverables
that are expected to emerge throughout the course of the project are
discussed and identified.
Background and Scope
The Digital Preservation Testbed is part of a wider
network of initiatives that the Dutch government has established to
deal with the challenges posed by the electronic era. The Testbed
belongs to the Digitale Duurzaamheid Programme, whose overall aim
is to guarantee the accessibility of information held by the government
in digital form (2). Three other projects
complete the Digitale Duurzaamheid programme: the RecordKeeping System
(RKS) project, establishing guidelines and providing advice to Dutch
Ministries on the selection of an RKS; the Kwaliteitzorg, concerned
with ensuring the quality of the records being produced electronically;
and the Taskforce DigitaleDuurzaamheid, whose main aim is to raise
awareness of the digital longevity issues throughout government. The
goal of the Testbed within the Digitale Duurzaamheid programme is
to help achieve the lasting accessibility of government information
in digital form. The Testbed will provide advice that is tailored
to the situation here in the Netherlands. Our focus is on the preservation
of electronic records for the long term, and our strategy begins with
preparing for the preservation of records from their point of creation.
Our intention is to ensure the reliable creation and management of
electronic records so that they are in a suitable state for long-term
preservation action. The Testbed is running controlled experiments
to explore options for long-term preservation approaches and the advice
on these will be issued to the Dutch government later this year.
Our research is initially limited to four main alphanumeric record
types: text documents, email messages, spreadsheets and databases,
all of which are widely used within ministries and government organisations.
Three preservation approaches are under consideration: migration,
emulation, and XML, which are discussed in more detail below. Four
record types and three approaches result in 12 possible combinations.
This initial set concentrates our resources and limits what is otherwise
an exponential and unstructured research area. Also, not every record
type is suitable for every preservation approach. For example, we
do not consider it to be worthwhile to attempt emulation for emails.
Email packages rely upon standard exchange formats that enable email
systems to be interoperable. The sender and receiver will often perceive
the look and feel attributes of a message differently. The question
then becomes: "what exactly am I trying to preserve?" You
need not preserve something that was not present in the first instance.
Emulation is thus not the best match for a preservation approach to
this record type.
An integral assumption in our research is that different
record types have different preservation and authenticity requirements.
Records ingested into the Testbed are analysed in terms of the five
attributes posed by Rothenberg for digital records: Context, Content,
Structure, Appearance, and Behaviour. Authenticity Requirements are
developed for each record type in terms of these five attributes and
act as the success criteria for an experiment's preservation approach.
In addition to this, other preservation and archival issues arise from
these experiments. Our research thus extends to wider considerations
and includes objectives beyond the success of a technical strategy.
Objectives
The objectives of the project are to provide insights
into:
- Technical solutions for the preservation of authentic
electronic records
- The effectiveness of current and potential preservation
approaches
- Authenticity features of digital records
- Cost factors for storing, preserving and managing
digital records and associated metadata
- Management processes and activities required
to capture, generate and maintain metadata that support the ingestion
of records and preservation of long-term access to authentic electronic
records
Research Questions
The Testbed Research Framework translates these
objectives into a clear set of research questions that are refined
and updated throughout the duration of the project. These range from
fundamental research questions that require comparing the results
of large groups of experiments, to questions focusing on the role
and significance of record features, attributes, and metadata that
may be answered by individual and smaller groups of experiments.
Fundamental Research Questions include:
- What are the advantages and disadvantages of
implementing each of the specified preservation approaches?
- What are the factors that affect the effectiveness
or appropriateness of a particular preservation approach?
- What are the basic requirements for preservation
functions?
These questions can be considered in light of cost, record type, authenticity
requirements, and supporting resources, to name but a few.
The subset of Attribute research questions includes:
- What are the options for preserving record attributes?
- What factors affect the preservation of record
attributes?
These questions consider attributes in terms of record type, software,
preservation approach, metadata implementation, and preservation function
implementation. Defining essential preservation metadata is also a
priority of the project, and is included in other research questions.
Preservation Approaches
As the Testbed Research Framework notes,
record keepers have not yet defined an explicit and definitive methodology
for any of the preservation approaches we are considering. The Testbed
is contributing to the delineation of various appropriate methodologies.
As discussed elsewhere, there are various ways to implement a migration
strategy (3). The same can be said for emulation
and the vaguely defined "XML approach." Indeed, there are
so many different ways to formulate a strategy that the boundaries
between them can become indistinct (4).
Let's begin by defining the various approaches: migration, emulation,
and the XML approach (5).
The Testbed definition of migration is a relatively simple one: the
transfer of records from one hardware and software configuration to
another. This includes migration through and over generations of application
versions, as well as across applications and operating systems, in
practice often on either proprietary or part-proprietary software,
but excludes media refreshing.
Closely related to this is the XML approach. Conversion to XML (or
conversion to standards in general) can be considered to be a form
of migration. However, XML is not tied to any particular software
system and is often regarded as the most promising present day format
for archiving and interoperability. It has a multiplicity of uses,
and so deserves to be considered as an approach in its own right.
For the emulation strand of our experiments
we will be working with IBM and will perform experiments with their
UVC (Universal Virtual Computer) on archival records at the Testbed.
The UVC approach is described in an earlier RLG
Diginews article. The UVC is a variation of the emulation approach
that addresses the problem of future interpretation of data files
by writing a program to carry out the interpretation in the language
of a Universal Virtual Computer. This strategy can be extended to
archiving the program as well, making it more like the full emulation
approach identified and proposed by Rothenberg (6).
Methodology
The Testbed is a practical project that runs controlled
experiments in a secure environment to study potential preservation
approaches, and to ascertain the effects of preservation actions on
different file formats and record types. In order to provide valid
and reliable results, the Testbed has established a rigorous experiment
process, that clearly articulates the requirements for each experiment.
An experiment process is defined around a specific preservation approach
and record type, e.g., migration of text documents, or conversion
of emails to XML. Each process consists of 12 documented stages. Once
a process has been defined and run for the first time, further experiments
can be run on that process using its generic requirements and procedures.
Figure 1. Flow chart of Testbed Project Experiment
Process
For example: Experiment process 1 caters to the migration of text
documents. The first five stages in the process are general stages
concerned with the broad requirements of the preservation approach
(migration) and record type (text documents). These stages define
the exploration area, identify relevant background literature, specify
authenticity requirements and evaluation criteria, develop an overall
experiment process design, and estimate the required resource specifications.
The remaining stages are then specific to an individual experiment;
that is to say, they specify the records to be used in that experiment
(e.g., Microsoft Word 95) and the specifics of the approach (e.g.,
migration to PDF). Further experiments can be run using the generic
documents from the first five stages of the Process. Several further
experiments can therefore be run using one experiment process. These
may examine different formats of the record type (e.g., Word Perfect
or Microsoft Word 2000 as text documents) or they may examine metadata
issues.
There are several advantages to this approach. The controlled, well-documented
experiment process allows each experiment to be easily reconstructed
in the future and allows experiments to be re-run to confirm or check
results, if necessary. The process lays out requirements to be considered
at each stage, from preservation requirements to functional requirements,
metadata requirements to authenticity requirements. The iterative
nature of the process allows us to produce a base set of documents
for each set of related experiments, thus eliminating any duplication.
Grouping experiments in this way also helps us to consider the combined
results of sets of experiments and the overall "success"
of each preservation approach. The secure and controlled Testbed environment,
in which all of the experiments are carried out, ensures that our
experiment results are valid and free from errors from extraneous
sources. We carry out regular control and null experiments to ensure
this remains consistent.
The practical nature of the project brings
other benefits as well. The experiment process covers all of the key
points in the history of the record, from creation (by way of model
records (7)) to capture, appraisal, and
long-term storage. These aspects can have unexpected effects on the
implementation and success of a preservation strategy. The Testbed
considers all of these aspects within the scope of the experiment
process, and they have yielded useful and interesting results.
Preliminary Results ( 8)
Our first experiments concentrated on the migration
of text documents. This combination was chosen for the first experiments
as the use of text documents is widespread throughout government and
many organisations already carry out routine migrations when updating
their computer systems. Our main goal in the early experiments was
to examine and identify record features that changed as a result of
the migration process.
Microsoft Word was identified as a good starting point. It is one
of the most widely used word processing applications available and
is used by many agencies to produce government records. Experiments
have taken place on the migration of text documents through and over
generations of Microsoft Word using both model and test records. Migration
through generations refers to migrating through successive versions
of an application, e.g., from Word 95 to Word 97 then to 2000 and
then to 2002. Migration over generations skips the intermediate versions
and goes straight to the current highest version, e.g., from Word
95 to Word 2002. We have experimented with migration through and over
generations of Word 95, Word 97, Word 2000 and Word 2002. We have
also experimented with migration of Word files to PDF 1.2, 1.3, and
1.4.
Results of the model record Word experiments
showed that if the record had been created well initially, it stood
a far better chance of retaining its features through and after migrations.
Fields that update automatically (e.g., date fields) and that were
not fixed after document creation wound up being updated every time
the document was accessed, thus altering an essential content and
context-reference feature. This is a problem whilst the records are
still in active use, let alone when they reach the archives. However,
most features migrated successfully. The position of the text on the
page was sometimes different, but colour, paragraph and font formatting,
bullets and numbering, inserted and well-formed tables, hyperlinks,
pictures and diagrams were all successfully retained in the experiments
we have carried out so far (9).
Use of records donated by government ministries took our investigations
to the next level. We had not been involved in the record creation,
so we could not be sure of how the features had been formed. One record,
which at face value looked like a well-formed table, turned out to
be composed of floating text boxes. Other records from different participants
had designated "protected sections" in which automated fields
had been used and then "fixed" in place. These sections
included such essential metadata items as date, author, recipients,
and a unique reference. Yet other records were composed on different
computers with different settings, or included text that had been
cut and pasted from a different application altogether. These "cut
and paste" sections are affected differently than the rest of
the document during and after a migration and can result in a change
in the appearance of the record without adversely affecting its content.
These experiences allow us to examine unexpected record
features and to assess more closely the ways in which records are
created. As a result, we can formulate advice on the creation of records
and use of record-creating applications, putting preservation concerns
in place from the beginning of the records continuum.
The set of Microsoft Word experiments showed
that generally, migration over generations was at least as
reliableand in some cases more reliablethan migration
through generations. This may counter some of the scepticism
about the costs of migration. Sceptics have argued that the recurring
costs of migration will be too great to bear. Our results so far have
shown that migration to each new version of an application should
not be necessary, and we hope that experiments with other word processing
applications will allow us to extend this hypothesis. The archival
regulations of the Netherlands state that ministries are responsible
for the authentic retention of their own records for the first 20
years, after which time a percentage of them are sent to the National
Archives for long term archiving. The rest can be disposed of according
to disposition regulations. It may be the case that Ministries can
simply retain the documents in their original format, with maybe one
or two controlled migrations, until the twenty years have passed.
The Archives can then undertake more suitable long-term action concerning
the current and future formats of the record.
This is simply one possibility that we are considering. We still have
many more experiments planned, and are waiting until all of the results
are in before we release full advice on the long-term preservation
of records produced by government agencies. The combined results will
allow us to weigh alternative approaches, in conjunction with metadata
and authenticity requirements, and determine the best way to implement
these approaches. There are many different ways to carry-out the same
task and it is unlikely that a "one size fits all" approach
will be suitable for different record types with different retention
requirements.
Products, Tools, and Deliverables
In addition to the research results, the Digital
Preservation Testbed will also develop a more concrete set of tools
and products. These include the Testbed Research Database, which supports
our experiments and acts as a valuable source of knowledge on many
aspects of digital preservation. This database is being built over
the course of the project and aims to collect and provide commentary
on relevant digital preservation literature. The contents are not
limited to publications, but also include public listserv messages,
Web sites, presentations, and Testbed references. Wherever possible,
we have gathered electronic copies of the research documents and stored
them in the Testbed system for reference by the project team. These
are supplemented by online resources (for which the URLs are checked
on a regular basis) and printouts. The research database is easily
searchable and will be a valuable record of the project, as well as
a useful resource. An abridged version for online resources only is
available on our Web
site.
Other deliverables include white papers on each of our preservation
strategies. The white
paper on migration was published in December 2001, and the XML
for preservation paper is scheduled for publication in late summer
2002. These white papers aim to provide a synopsis of current knowledge
about each preservation approach, and to delineate ways in which the
approach can be implemented. We will also deliver a technical report
on the Testbed system itself, for which an extensive set of functional
requirements is being developed. The Testbed Newsletter is
produced quarterly and the Web site is updated monthly.
Thus far, the Testbed has completed groups of experiments on emails
and text documents and is issuing preliminary advice on the short
and long-term preservation of emails. Work will soon commence on spreadsheets
and databases, for which advice will also be released. The Testbed
project is due to finish in October 2003 but preservation research
is unlikely to stop there. The Digital Longevity Program will continue
to run and coordinate digital recordkeeping and archival efforts for
the Government of the Netherlands.
Footnotes
(1) Jeff Rothenberg & Tora Bikson: Digital
Preservation. Carrying Authentic, Understandable and Usable Records
Through Time. (The Hague, 1999) (back)
(2) The Testbed and the Taskforce are also
part of the ICTU, a non-profit organisation established by the Dutch
government to house their e-government projects, including PKI (Public
Key Infrastructure) and advies overheid (monitoring and advising
on government Web sites at every level). See http://www.ictu.nl
for further details. The close proximity of these projects allows
them to easily collaborate and share information. (back)
(3) Testbed Digitale Bewaring white paper,
Migration:
Context and Current Status (The Hague, 2001). (back)
(4) See for example Kees van der Meer et
al, Emulation
and Conversion: Organisational and Architectural Overview of an Electronic
Archive (Technical University of Delft and Utrecht University,
2001). This can also be seen by comparing current literature from
around the globe dealing with preservation strategies and enforcement.
(back)
(5) See the Testbed White Paper on Migration
(op cit) for a more extensive discussion on each of these approaches.
(back)
(6) Rothenberg & Bikson, op cit. See
also Jeff Rothenberg: An
Experiment in Using Emulation to Preserve Digital Publications
NEDLIB Report Series (The Hague: NEDLIB consortium, 2000) and Avoiding
Technological Quicksand: Finding A Viable Technical Foundation
for Digital Preservation: A Report to the Council on Library and Information
Resources. (Washington, D.C.: Council on Library and Information
Resources, 1999). (back)
(7) The Testbed uses two sorts of records
in its experiments. The first are model records, which are created
in the Testbed to examine and evaluate the effects of preservation
action on specific record features (e.g., user-defined and automatic
fields, templates, font and paragraph formatting, and signatures).
The advantage to using our own records for this purpose is that we
know exactly which features are present in each record and where.
This allows us to carry out highly focussed experiments on record
attributes. It also serves as a good starting point for any new round
of experiments. The second sort of records includes test records.
These are obtained from ministries and other government organisations,
and are used in the larger-scale experiments that address the fundamental
research questions and authenticity requirements. (back)
(8) This section is intended to give the
reader a flavour of the types of results we have gathered so far.
Future reports from the Testbed will discuss our results to a greater
extent than this introductory article. (back)
(9) The exact position of the text on the
page can be affected by as small a thing as changing the printer or
printer driver. This is not a change that has affected the authenticity
of the records in any of our experiments to date, but it has resulted
in documents containing several more pages than they did originally,
especially if page breaks have been employed. (back)

Publishing
Information
RLG DigiNews
(ISSN 1093-5371) is a newsletter conceived by the members of the Research
Libraries Group's PRESERV community. Funded in part by the Council on
Library and Information Resources (CLIR) 1998-2000, it is available internationally
via the RLG PRESERV
Web site. It will be published six times in 2002. Materials contained
in RLG DigiNews are subject to copyright and other proprietary
rights. Permission is hereby given for the material in RLG DigiNews
to be used for research purposes or private study. RLG asks that you observe
the following conditions: Please cite the individual author and RLG
DigiNews (please cite URL of the article) when using the material;
please contact Jennifer Hartzell,
RLG Corporate Communications, when citing RLG DigiNews.
Any use other than for research or private study of these materials requires
prior written authorization from RLG, Inc. and/or the author of the article.
RLG DigiNews is produced for the Research Libraries Group,
Inc. (RLG) by the staff of the Department of Preservation and Conservation,
Cornell University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern;
Production Editor, Barbara Berger Eden; Associate Editor, Robin Dale (RLG);
Technical Researchers, Richard Entlich and Peter Botticelli; Technical
Coordinator, Carla DeMello; Technical Assistant, Kimberly Gazzo.
All links in this issue were confirmed accurate as of June
10, 2002.
Please send
your comments and questions to preservation@cornell.edu.

|