RLG
 Feature Article 1  

A Comparison Between Migration and Emulation in Terms of Costs

Authors: Erik Oltmans - National Library of the Netherlands (erik.oltmans@kb.nl), Nanda Kol - IBM / Delft University of Technology, the Netherlands (nandakol@hetnet.nl)

1. Introduction

Digital publishing is causing publishers, research institutions, and libraries to develop new policies, new infrastructures and techniques, and new business models as well. A major problem is that, at the same rate at which our world is becoming digital, digital information is threatened. New types of hardware, computer applications, and file formats supersede each other, making our recorded digital information inaccessible in the long term. The Koninklijke Bibliotheek (KB) has, jointly with IBM, developed and implemented an OAIS-based deposit system: the e-Depot. Moreover, the KB signed archiving agreements with major scientific publishers for permanent storage of their digital materials. An important issue in digital archiving is long-term access: how can we guarantee permanent access to digital publications while software and hardware are constantly changing? This issue strongly relates to the object’s life cycle management, as ineffective life cycle management might compromise availability of the digital object in the long run.

In this paper, we discuss life cycle management issues as they relate to two prominent digital preservation techniques and associated costs: migration and emulation.  We argue that applying the emulation strategy may be more efficient in terms of life cycle management (and thus costs) than the migration strategy. We introduce the KB e-Depot, in which the main workflow is described, discuss the two main digital preservation strategies, and relate these strategies to life cycle management and cost issues.

2. The KB e-Depot

In 1999, the KB specified the system requirements for a full-scale deposit system, which were based on the ISO 14721 standard for digital archives: the Open Archival Information System.1 As a result of a European tender procedure in 2000, the KB contracted the development of the deposit system to IBM in the Netherlands. In December 2002 the system was delivered to the KB. IBM constructed the system using as many off-the-shelf components as possible, such as WebSphere, DB2, Tivoli Storage Manager, and Content Manager, and branded it under the name Digital Information Archiving System (DIAS). Using DIAS, the KB maintains the deposit service called the e-Depot. (See Oltmans & Van Wijngaarden, 20042 for a complete description and Steenbakkers, 20033 for more details about the history of the KB e-Depot.)

The KB has developed a workflow for archiving electronic publications and has implemented the other parts and interfaces of the infrastructure in which DIAS is embedded. This infrastructure consists of a variety of functions for:

  • validating and pre-processing electronic publications,
  • generating and resolving unique identifiers,
  • searching and retrieving publications, and 
  • identifying, authenticating, and authorizing users.

The process of loading consists of pre-processing and ingesting the digital content. Electronic publications are stored in the e-Depot on offline media such as CD-ROMs (also referred to as “installables”, cf. Oltmans & Van Wijngaarden 20042 for more details) and online, most often electronic articles. Online journal articles are either sent to the KB on tapes or DVDs, or they are fetched by means of FTP. In both cases, publications ready for ingest end up in an electronic post office. At this stage the content of the submitted publication is validated in regard to its authenticity and well-formedness, based upon earlier agreed upon specifications. If the material does not match the checksum (or if other errors occur), the content is passed to a database for error recovery. If the content appears to be valid, content and metadata are combined to form Submission Information Packages (SIPs). These SIPs are then processed by DIAS. See Figure 1 for a complete overview of the data flow.

fig 1

Figure 1: General e-Depot Data Flow

DIAS ingests both the content and the metadata, converting the publisher’s bibliographical descriptions into the KB internal format and adding a National Bibliographic Number (NBN). This number functions as the unique identifier of every digital item stored in the system. The content itself is stored in the e-Depot, while the metadata is stored in the KB catalogue. Technical metadata is stored and maintained by using the Preservation Manager, developed in collaboration with IBM.4 End-users may query the online catalogue and retrieve the full text of the publications. In the case where access restrictions are imposed by the publisher, retrieval may occur only after a process of identification, authentication, and authorization (IAA). The e-Depot itself cannot be accessed directly, but passes relevant publications to the end-user after verification.

Six major publishers have signed distinct archiving agreements with the KB for long-term digital archiving of their electronic publications:

  • Elsevier Science
  • Springer
  • BioMed Central
  • Blackwell Publishers
  • Oxford University Press
  • Taylor & Francis

At this moment the digital publications of these publishers are being loaded into the e-Depot, involving more than 2,500 journals, containing over 8 million articles. For publications that are processed both in digital and printed form, the KB has decided to process only the digital manifestation of the publication. New acquisition methods are studied in order to obtain the electronic publications, such as extensions of the OAI harvesting protocol.5, 6 With respect to version management of electronic publications, the e-Depot is able to deal with updates and retractions. Updates of electronic publications are sent to the KB with different time stamps compared to the original submissions. These authentic publications are not discarded from the system, but the original metadata is temporarily withdrawn, so that only the updated material will be found in the central catalogue. This way, the KB does preserve the complete record of science, but allows publishers to ask for the temporary withdrawal of specific articles. Once an article is in the system, it will never be discarded or deleted.3

3. Long-term preservation: migration and emulation

New types of hardware, computer applications, and file formats are continually being developed, making digital information from the past inaccessible. Even if the hardware or the carrier-media do not deteriorate, the technology to access the information will inevitably become obsolete. Preservation or permanent availability of digital information is one of the processes dramatically affected by the evolution towards an all digital world.

In general there are two main digital preservation approaches. The first one focuses on the digital object itself and aims at changing the object in such a way that software and hardware developments will not affect its availability. By changing or updating the format of an object, it is made available on new software and hardware. The digital object will be adjusted to changes in the environment, which makes it possible to render objects by using current systems.

The second approach does not focus on the digital object, but on the environment in which the object is rendered. It aims at (re)creating an environment in which the digital item can be rendered in its authentic form. The first approach (changing the object) is known as migration or conversion. The second approach (changing the environment) is known as emulation. Both models are considered for implementation at the KB, and each will be discussed here in brief.

3.1 Migration

With migration, file formats will be converted into new formats as soon as the original formats run the risk of becoming obsolete. For example, if technology scans indicate that PDF version 1.1 will soon be inaccessible, all files in the digital archive of format PDF 1.1 will have to be converted into, for example, PDF format version 1.4. This way, the digital publications will be prepared for rendering for another period of time, until the format PDF version 1.4 runs the risk of becoming obsolete itself. At that time another migration procedure will need to be carried out.

An advantage of migration as a digital preservation strategy is that electronic publications will always be available in the form that is generally accepted, e.g., PDF, and current hardware and software will be able to render these formats with little difficulty. Older documents that are properly migrated will be available for some time in the present and the near future, and their electronic content can be used for copy and reuse. A major drawback might be that while converting documents from one form to another, some aspects of the document’s layout or–even worse–data might get lost. If preserving the original “look and feel” of the document is important, or when one is dealing with with dynamic objects, then migration is probably not the best solution. Moreover, migration is necessary for every single document in the collection, and should preferably be carried out each time a serious update of the file format is available. It may be straightforward to convert from version A to version B, but converting from version A to version C or D might be a complicated matter (see Caplan, 20047 for an elaboration on this issue). Moreover, with migration it may be impossible to perform a conversion if the file format and the migration tool are no longer active. When applying the migration strategy we have to constantly study conversion programs and execute them when possible, so as to prevent digital information from getting lost.

3.2 Emulation

Emulation, on the other hand, preserves the authentic document and provides the user with a tool that enables “old” software and “old” viewer programs to render this original document. An emulation tool generates an authentic view by launching the original viewer in the context of the original platform. The emulation tool makes the original viewer and the original platform work in future environments.

An advantage of emulation is that the original “look and feel” of the publication can be preserved. As with preserving books, the authentic instantiation will be there to be rendered, in contrast to migration in which possible other instances are used rather than the original. However, a serious drawback is the complexity of developing and maintaining emulation tools. In the future, we will have to maintain several emulation tools, and it cannot be proven that these will always work on future computer platforms. Maintenance of emulation tools can be reduced considerably if a virtual layer is introduced. This means that emulation tools are developed to run on a virtual machine, of which the upper side—the interface to the emulation tools—remains the same through time. One only has to adapt the bottom side of the “virtual machine” once in a while. This way, emulation tools will remain unaffected.

3.3 The need for both emulation and migration

In the next section, emulation and migration techniques will be compared in terms of life cycle management and associated costs. In general, we will see that cost arguments will be in favor of emulation techniques. However, this does not imply a strong preference for emulation over migration. There are arguments both for preserving the original “look and feel,” as well as for converting documents to new standards.

The main reason for preserving the authentic form is that the KB digital archive serves as a safe place for original materials from publishers. The KB promises to keep the original bit stream of the received document. In the future, emulation tools will be needed in order to render these publications in the same way as they were published originally. Secondly, authenticity of a publication may be of importance for end users who want to access publications and experience the original “look and feel.” For these reasons, emulation tools are needed.8

On the other hand, there is a specific need for converting documents into the most current standard as well. For future end users who want to have access to publications according to the standards and functionalities of that time, migration might be needed in order to copy and reuse data. In short, we do not favor one strategy over another. In fact, both are studied and considered for implementation at the KB.

4. Long-term preservation and associated costs

Any particular digital preservation strategy strongly determines the life cycle management of digital publications and thus the associated costs. In order to specify the long-term costs of a digital archive, it is necessary to understand the implications of choosing a particular preservation strategy. The resulting costs may in turn determine (or limit) the choice for certain preservation strategies. Unfortunately most business models and cost-estimates available in the literature only address the general preservation issues and say little to nothing about the costs of specific strategies (although Dürr, 20019 specified some numbers about the costs of migration). The following comparison is based on available information from the literature in combination with our own insights and the experiences at the KB.

Where the conversion of objects to other formats constitutes a considerable cost factor in migration, these costs can be saved when applying emulation. In turn, emulation requires more initial investments, which makes it inappropriate for short-term preservation. For a proper cost comparison, the costs of each strategy should be specified in relation to the term for preservation. Emulation costs can be classified by the tasks that should be executed in order to realize access in the future. Consider the following task overview:

  • One-time costs: Development of an emulation device
  • Recurring costs: Developing emulators for the components of the original hardware platform
  • At access time: Running the emulator and the appropriate software environment.

The costs of developing an emulation device can be derived from the costs associated with the creation of a demonstration version of the Universal Virtual Computer (UVC), which was implemented in 2004 by IBM as a result of a request of the KB. Including research and design, it took 32 weeks of 40 hours a week to accomplish this task. If we assume an hour rate of $120 the one-time costs of developing an emulation device are approximately $150,000. The actual implementation costs (excluding research and design) are estimated at $20,000 (for a total of 160 man-hours). The first prototype of this emulation tool based on the Universal Virtual Computer concept is now available. For more information about this project, we refer to Lorie, 200210 and Van Wijngaarden & Oltmans, 2004.11 More information and a “proof-of-concept” demonstration of the UVC can be obtained from http://www.alphaworks.ibm.com/tech/uvc.

The development of emulation tools requires serious research and development, and requires technical expertise to implement the concepts. Once it has been implemented it should also be maintained. Moreover, one must develop an emulation tool each time objects from a new platform are accepted into the archive, which also requires investments in both research and implementation. We assume maintenance costs will be relatively low (between $2,000 and $3,000 per year). The development of an emulation tool will be more costly. However, one must realize that the costs of developing emulation devices can easily be shared by digital preservation repositories all over the world, since once an emulator is available, it can be used to access any digital object that used to run on the emulated platform. In contrast to migration, where the task must be executed for each object separately, an emulator can be used to access a whole range of digital objects. In short, emulation tools can be shared among institutions, which makes it possible to share the costs of research and development investments as well.

Unfortunately the costs of accessing a digital object using the emulation approach are not known. In order to make a proper cost estimate one should account for the time necessary to configure an emulation of the original hardware using the appropriate emulation modules. Then one should add the time necessary to install the software environment that is associated with the specific digital object. This time will be less if the process is automated.

Migration on the other hand is relatively cheap in the sense that many conversion tools are available, and executing a conversion program is a relatively straightforward task. However, by definition, migration applies to the entire collection repetitively: each and every single object in the digital archive has to be converted, again and again. This means that the bigger the archive gets, the more expensive migration will be. This is in contrast with emulation: emulation tools apply to the collection as a whole and do not affect the format of individual digital objects.

In order to specify this difference exactly, we will use the cost model formula as presented by Shenton, 2003.12 The costs of preserving a non-digital monograph over time are taken as an example. Shenton specifies them as follows:

(1) K(t) = s + a + c + p1 + h1 + p(t) + h(t)

Where K(t) is the total cost of holding an item for a period of t years, where s=selection, a=accessioning, c=cataloguing, p1=initial preservation, h1=initial handling, p(t)=longer-term preservation, h(t)=storage.

By applying this formula, the long-term costs of preserving a monograph can easily be calculated. Moreover, this formula makes it possible to calculate the downstream costs if for instance, an additional $100,000 would be available on acquiring monographs.

It is of significant interest that such a formula is available for electronic publications as well. Obviously, not all the variables in Shenton’s formula are applicable to digital objects. A fundamental part of every formula that applies to electronic materials would consist of:

(2) K(t,a) = s(a) + i(a) + h(t,a)

Where K(t,a) is the total cost of holding a objects for a period of t years, where s=selection, i=ingest, and h=storage.

The selection process is quite obvious: it consists of acquiring the objects and preparing them for further processing. The ingest process consists of the automatic processing of the digital objects by some sort of software program. It should, among other tasks, convert the associated metadata into a usable format and store the digital objects on some sort of storage system. The storage costs itself are for purchasing storage media, applying media refreshment, and maintaining some sort of database management system. There is a direct relation between the overall costs and both the number of items and the number of years they are preserved. More items will cost more, and storage for a longer term will cost more as well.

A part of the formula above is fundamental for two other formulae that we propose. However, these will not include the costs for selection and ingest for two reasons. First of all, the costs for selection and ingest will be the same for both emulation and migration; in other words, they will not influence the relative difference. Secondly, it is quite difficult to estimate these costs, as they depend on the archiving agreements with the publishers (selection) and the type of software that is in place (ingest). Both will differ considerably, depending on the circumstances. Therefore we focus on storage costs and the dedicated costs for both migration and emulation.

The first formula, for migration, is as follows:

(3) K(t,a) = h(t,a) + m(t,a)

Where K(t,a) is the total cost of holding a objects for a period of t years, where h=storage costs, and m=migration costs.

A new variable is introduced that expresses the costs of migrating an object. The costs of migrating digital objects is dependent on the time t (the longer we preserve the objects, the more often we have to convert them) and on the number of objects a (the more objects in the archive, the more conversion actions have to be executed).

The formula for calculating the emulation costs is as follows:

(4) K(t,a) = h(t,a) + E + e(t)

Where K(t,a) is the total cost of holding a objects for a period of t years, where h=storage costs, E=costs of setting up the emulation virtual machine, and e(t)=costs of emulation over time.

Two new variables are introduced: the one time costs for developing an emulation device and tools are expressed by E, while yearly maintenance of the emulator are expressed by e. Maintenance costs and costs for the development of emulation tools are independent of the number of objects: the emulation device and other emulation tools apply to the entire collection, and no special action is needed when rendering an object in the digital archive. However the emulation tools need to be maintained over time, which makes the maintenance costs dependent on the number of years. (In this cost comparison it is assumed that migration is not suitable for the preservation of dynamic digital objects. Therefore this cost comparison only focuses at the costs of preserving static digital objects.)

Having the first, primitive, formulae in place, we can now associate specific variables with the cost components. The costs for storage and migration are based on figures from the literature (for instance, see Fox, 200213 for an elaboration on storage costs), but may vary. A complicating matter in this respect is that both figures usually express costs per Megabyte or Gigabyte. Since insufficient information about the costs of migration and emulation was available in the literature, a number of assumptions have been made:

  • The number of objects (a) in the repository increases every year.
  • Storage costs are estimated at $0.05 per object per year.
  • The required storage space for both strategies is about equal (the additional storage space required for the preservation of the software environment in case of emulation is considered negligible).
  • The average costs of migrating an object into a newer format (m) are $0.10 per object per year.
  • The one-time costs for creating an emulation device, including research and design, (E) are $200,000. (It is assumed that this will be more complex than the UVC mentioned before.) 
  • The average costs for maintenance of the emulation device  and development of emulation tools for new sorts of objects (e) are $30,000 per year.
  • The costs of running the emulator and the appropriate software environment at access time are not included.

Considering these figures, the costs of preserving 1,000,000 objects for a period of 50 years can now be calculated, both when applying migration and emulation as the leading digital preservation technique. The graph below demonstrates these costs:

fig 2

Figure 2: Costs for migration (blue line) and emulation (red dotted line) for maintaining an archive of 1,000,000 digital objects over a period of 50 years. The initial investments of setting up an emulation tool yields high costs in the first five years. But soon after that, the migration costs are higher than the emulation costs and the difference increases every year. In 50 years, the migration costs are 79% higher than the emulation costs.

The difference between emulation and migration is clear, and can be explained by the difference in costs coverage of the two techniques. As the size of the collection directly affects the migration costs, it will be clear that the bigger the archive gets, the higher the migration costs will be. This effect is demonstrated in the second graph. It covers the same period of years, but the size of the collection now is 5,000,000 instead of 1,000,000.

fig 3

Figure 3: Costs for migration (blue line) and emulation (red dotted line) for maintaining an archive of 5,000,000 digital objects over a period of 50 years. Compared to the graph in Figure 2, the size of the collection is five times as big and the migration costs are now more than twice as high as the costs for emulation.

As shown, the figures vary with the values that are assigned to the variables. The period that it takes before emulation is less expensive than migration is dependent on the following factors:

  • If the number of objects to be preserved is increased, less time will be needed for emulation to be more economical than migration.
  • If the average migration costs per year per digital object are higher, it will take less time before emulation is cheaper than migration.
  • If the costs of developing an emulation device are increased, it will take more time before emulation is less expensive then migration.
  • If the maintenance and development costs for emulation per year are higher, more time will be needed before emulation is more economical than migration.

As said, a number of assumptions were made with respect to the values of the variables. In order to demonstrate the effects of other assumptions, we developed a spread sheet so as to assess the consequences of different values. The spread sheet is freely available for test purposes, and researchers and digital archive managers are invited to share their opinions and experiences with respect to cost issues while using the spread sheet. (Click here to download spreadsheet.)

5. Conclusion

In this paper we discussed life cycle issues in the context of long-term preservation of digital objects. At the National library of the Netherlands a fully operational digital archive is in place, and this archive, the e-Depot , provides the context for studies in a number of digital preservation techniques. Both emulation and migration are discussed, as the KB needs to provide access to the original publication as sent by the publisher, while at the same time the KB wants end users to access converted materials according to the most recent standards and functionalities.

Emulation and migration are inherently different in terms of life cycle management, which causes a serious difference in costs. While migration applies to all objects in the collection repetitively, emulation applies to the entire collection as a whole. This makes emulation most cost-effective in cases of large collections, despite the relatively high initial costs for developing an emulation device. When considering the fact that only small fragments of digital archives need to be rendered in the long run, it may turn out that from a financial perspective emulation techniques will be more appropriate for maintaining larger archives.

In this overview, we deliberately neglected a number of important issues. First of all, we did not consider the fact that the migration costs per object may be less if the number of objects to be converted gets considerably high (economies of scale). What is more, we calculated the costs of migration in terms of the number of objects, while it would also make sense to calculate the costs in terms of Gigabytes. The problem here is that preservation costs are connected to the number of objects and it is not clear how many objects are contained in a Gigabyte. This points out that the issue of costs in digital archiving needs more study and should be based on practical experience. The results presented here are a first step for determining life cycle issues in digital archiving, and may serve advanced studies that reach for a complete understanding of cost models in long-term preservation. Participants in this research field are invited to challenge or further develop the cost formulas, preferably by using the attached spread sheet.

The authors wish to thank Hilde van Wijngaarden (KB) and Raymond J. van Diessen (IBM) for valuable comments on earlier drafts of this paper.

Notes: 

1OAIS 2002. Consultative Committee for Space Data Systems, Reference Model for an Open Archival Information System (OAIS), Blue Book, 2002.

2Oltmans, E. & Van Wijngaarden, H. (2004). Digital Preservation in Practice: The e-Depot at the Koninklijke Bibliotheek. In: VINE – The Journal of Information and Knowledge Management Systems, Vol. 34 (2004), No. 1.

3Steenbakkers, J.F. (2003). Permanent Archiving of Electronic Publications. In: Serials, Vol. 16 (1), March 2003.

4Oltmans, E., Van Diessen, R.J. & Van Wijngaarden, H. (2004). Preservation Functionality in a Digital Archive In: Proceedings of the Joint Conference on Digital Libraries, Tucson, Arizona, June 7-11, 2004.

5Lagoze, C., Van de Sompel, H., Nelson, M., Warner, S. (2002). The Open Archives Initiative Protocol for Metadata Harvesting, Version 2.0, 2002.

6Jerez, H.N., Liu, X., Hoschtenbach, P., and Van de Sompel, H. (2004). The Multi-faceted Use of the OAI-PMH in the LANL Repository. In: Proceedings of the Joint Conference on Digital Libraries, Tucson, Arizona, June 7-11, 2004.

7Caplan, Priscilla (2004). Building a digital preservation archive: Tales from the front. In: VINE – The Journal of Information and Knowledge Management Systems, Vol. 34 (2004), No. 1.

8Van Diessen, R.J. & Van der Werf-Davelaar, T. (2002). Authenticity in a Digital Environment. IBM/KB Long-Term Preservation Study Report Series #2, IBM Netherlands, Amsterdam. Available through http://www.kb.nl/e-depot.

9Dürr, E., van der Meer, K. (2001). Emulation and conversion – Organisational and architectural overview. Way of working, costs, methods. Available through http://www.library.tudelft.nl/e-archive/Documenten/Resultaten/roquade2.pdf.

10Lorie, R. (2002). The UVC, a Method for Preserving Digital Documents: Proof of Concept. IBM/KB Long-Term Preservation Study Report Series #4, IBM Netherlands, Amsterdam. Available through http://www.kb.nl/e-depot.

11Van Wijngaarden, H. & Oltmans, E. (2004). Digital Preservation and Permanent Access: The UVC for Images. In: Proceedings of the Imaging Science & Technology Archiving Conference, San Antonio, USA, April 23rd 2004.

12Shenton, Helen (2003). Life Cycle Collection Management. In: Liber Quarterly – The Journal of European Research Libraries, Vol. 13 (3/4).

13Fox, Peter (2002). Archiving of electronic publications – Some thoughts on cost. In: Learned Publishing, Vol. 15, No. 1, January 2002.


Copyright 2004 RLG.