Kevin Ashley, University of London Computing Centre
Introduction
The security and authenticity of digital materials are closely-related topics. In almost any endeavour involving preserved digital materials, there will be some degree of concern about the identity of the reader and the provider as well as of the preserved object itself, and there may be complex relationships between these three identities and what is permitted and not permitted. I will dwell today primarily on the issue of object authenticity, and the ability to draw parallels with the world of traditional preserved materials, and talk also about the possibility of presenting variant forms of objects, in which variable quantities of information is released, in domains where information is sensitive and controlled access applies.
However, it may be useful to summarise the main issues surrounding each of these areas of authentication and identification. For those of you steeped in these issues (and many of you probably understand them in more depth than I do), I apologise for going over old ground. If you are reading this paper, feel free to skip to the next section. If you're listening to me, take a 5-minute nap.
Throughout this presentation, I will be using certain terms and making certain assumptions which I'll set out here.
First, I am presuming that preservation of materials is always tied up in some way with access to them. This link isn't always direct, it isn't always done by the same people and it doesn't always happen at the same time, but if we are preserving things it must be because we believe, at some point in time, that someone will be provided with access to them.
Second, I'm going to be referring to three entities:
- Content providers - people who initially generate the material we're preserving, and have some sanction over whether we preserve it, who we give it to, etc. They may be publishers who expect a revenue from access to our holdings, or they may be individuals or institutions depositing material with an archive with no expectation of future revenue. They may be the same people as the service providers or the users, or they may not be.
- Service providers - the people who are doing the actual preservation and/or providing the access. I'm viewing all of the issues today from their point of view, since that is what this conference is about.
- Users - the people who want to gain access to our preserved materials, or to the metadata about them. They may not actually be "people" in the sense of human beings, but may instead be other services, intelligent agents, other repositories etc. Sometimes our users are also our content providers, or overlap with them. Sometimes they are the same organisation as the service providers. Sometimes we don't even know who they are, and we don't care. But sometimes we care a lot.
Authentication and identification
Users - "I'm me"
In any type of digital preservation activity, we will have some need to identify users or clients. This will apply whatever it is that our "users" are doing or whether they are human beings or not. Even if our collection is not open to any form of access, we still have to contend with access by our collection administrators, who must still provide some means of proof of their identity. In fact, we can argue that it is critical that administrative access is subject to strong proofs of identity in order to allow us to construct audit trails relating to the addition of material and metadata, and any other activity that the administrators may be permitted to carry out. Without this form of authentication, we may be in a difficult position if we have to demonstrate the authenticity of the preserved materials.
But in most cases, the digital archive, library or museum is concerned with external users and identifying them in some fashion. There are a number of different reasons for wanting to do so, and different levels of assurance we might want for each. At the simplest level, we may have material with no restrictions on access, and no charging involved. We probably still want to count the number of different users we have, if only to justify to someone why we're preserving this material or going to the effort of making it available. We can meet this requirement with simple self-registration and basic HTTP authorisation. It doesn't offer much in the way of security, but we don't need that just to count people. In particular, our users don't really care if an identification system such as this is capable of being broken, since they don't really have anything to lose.
We may, however, be dealing with resources where some form of charging applies, either at the time of use or in advance. In the latter case, there is an implication that there is a predefined set of users entitled to access the resources associated with a particular payment. We need something with a little more security in an environment such as this. Typically, our users now have a vested interest in ensuring that people cannot masquerade as them or steal their authorisation, since this is equivalent to stealing their money. If content providers are being paid, they will also wish to ensure that our system does not allow people to masquerade as someone else, although exactly who cares most about the security depends to a great extent on the charging model. Up-front payments which entitle you to unlimited access means there is positive incentive for users for the system to be somewhat insecure, but a strong incentive for content providers and service providers that it is secure. Per-use charges put the incentive firmly back in the user's lap. As a service provider, I don't really care in this case who is accessing the stuff as long as I get my money (would be the cynical view, at least.)
But we may also need to restrict access, or prove identity, for reasons other than fiscal ones. If we're storing sensitive material of some form it has probably been deposited or acquired with an understanding that access to it is either restricted to a known group, to individually authorised users, or to noone at all for some period of time. In this case both the content provider and the service provider have a strong interest in ensuring that users are who they claim to be, and that masquerading is very difficult to do. The users may or may not care; by imposing a duty of care on the users to protect whatever secret information we provide to them (such as a password or key) to identify themselves, we can make them care more than they might do otherwise.
Service providers - "you're you"
Sometimes it's also important for users to be confident that they are dealing with who they think they are. This is obviously the case for transactions that involve disclosing personal information or financial information such as credit card numbers. But it can also have implications when we're looking at any preserved material whose provenance may be suspect or whose accuracy or even very existence is a matter of dispute.
Encrypted communications, digital signatures and hierarchies of signature validation such as are used to sign web security certificates go some way towards meeting this problem, but they clearly only solve part of it. If I am dealing with a national institution such as the Public Record Office or the British Library, then the certificate itself is enough to enable me to establish my degree of confidence in the material I am looking at. My knowledge of these institutions and the degree of trust I place in them is something that is independent of any technological system. The technology has merely assured me that I'm dealing with who I think I am. (If the certificate is signed by "Honest Ron's Validation Service" I may want to look somewhat closer at the chain of trust, however.)
Technology does not, of itself, enable me to place my trust in someone I've never dealt with before. The mechanism of security certificates does not vouch for the accuracy of what an organisation does, nor even for its financial probity. It merely tells me that the people who made my web browser trust someone who trusts someone who trusts someone who checked that these people are who they say they are.
Object Authenticity - "is that that?"
Assuming that a user has identified themselves to the satisfaction of the service provider and the service provider has satisfied the user as to their own identity, they can now begin to enter into a dialogue regarding preserved digital materials and possible access to them. The user may now be concerned with the authenticity of the objects themselves. What can we do to reassure them of this ? In archives and museums, and the manuscript collections of libraries, the principal means of attesting this is via a mixture of provenance and trust. Provenance attempts to document an unbroken chain of custody through a succession of mutually-trusting entities, and that chain of trust is then passed through the service provider to the user.
In fact, the chain of trust need not be unbroken for the material still to have value. It may well be that an object has a murky past, but is interesting nonetheless because of its later, and more well-documented history. An example of this might be the Zinoviev letter, now part of the National Archive at the PRO. If we ask to see this object at the PRO, we can have confidence that it is the Zinoviev letter and in that respect we believe it to be authentic - but on another level, its authenticity is in extreme doubt. Its provenance is most certainly shrouded in some degree of mystery.
Digital techniques allow us to attach a much greater degree of confidence as to the accuracy of digital objects. Digital signatures allow us to attest with quantifiable certainty whether or not an object has been altered since it has been deposited. Most of those doing preservation will be making active use of such signatures in the monitoring they will be undertaking of the material they preserve.
Parallels with the traditional world
It seems to me that we are sometimes struggling in the world of digital preservation to achieve unattainable perfection in our methods. People will ask whether I can guarantee that material will never be lost, that it will never be corrupted, that it can always be read. I cannot, of course, give any such guarantee. What I can do is give quantifiable answers to as to the likelihood of loss or corruption given a particular means of storage, and also quantifiable costs for decreasing that likelihood by any given amount. Given relatively accurate figures for media lifetimes and the patterns of early failure and for the error detection and correction mechanisms used within particular subsystems this is a relatively straightforward task. What is more, the figures we end up with are usually much better than those in the world of traditional materials. Even more usefully, we actually do have figures. Ask the average library or archive what their expected loss or damage rate in pages per year is and they probably cannot tell you.
This point illustrates the more general problem we face with issues of authenticity and authorisation. We are too often being asked to meet impossibly high standards, or certainly to meet standards that far exceed those being applied to traditional repositories. In most cases, only the most basic of checks is made on the identity of users, and there is very little ability to do accurate checks on the completeness and accuracy of the holdings or their catalogues. Balanced against this is the fact that, particularly where unique materials are involved, there is often close supervision of the actual process of access, and little opportunity to damage what is provided.
But a number of cases, at institutions such as the PRO and the Victoria and Albert Museum, demonstrate that even the best defences do not protect against damage to objects or their metadata. Because those objects were not digital, and therefore it was not possible to have exact copies stored elsewhere to guard against theft or damage, the loss is permanent. If even these august institutions cannot protect themselves, what hope for the smaller local archive or museum ? By contrast, even if appalling security at our site were to let hackers get in and wreak havoc with our stored data, the fact that we can have identical copies and signatures of all material stored with trusted third parties means we need never suffer permanent loss. Even if we do not detect the intrusion at the time, a later periodic comparison with the off-site store will indicate that something is awry. In the case of the damaged catalogues at the V&A, we do not know to this day which entries are spurious, since no copy of the catalogues existed in any form whatsoever.
In the case of attesting authenticity, we can certainly provide more convenient mechanisms for providing authenticated copies of material to users. The Public Records Acts, for instance, provide for the concept of authenticated copies of records, which are considered to be sufficient to be adduced as evidence in court in place of the originals. It's not cheap to by them, and the process effectively provides single-short authentication. After the copies are produced and checked, they are wrapped and sealed and a certificate of authenticity is attached to the package and its seal. This is sufficient to show that the documents are authentic copies of public records only until the point where the seal is broken. By contrast, digital materials can be placed in a package which can be 'opened' time and time again, by the simple technique of using public-key cryptography and encrypting the objects with the repository's private key.
We aren't doing as well with means of identifying our users. Most of the techniques in common use today are not much better than reader's cards and just as open to abuse. Anything based on usernames and passwords can be abused (with the cooperation of both parties) just as easily as a readers card can be loaned to a non-user. We can improve this slightly by tying the user of IDs to things such as network addresses or even times of day, but these restrictions are often unacceptable and still don't provide very firm guarantees. Whether any of this really matters comes back to the issue of knowing why we are trying to identify our users in the first place.
Selective disclosure
This brings me to the issue of selective disclosure, techniques which we've had to adopt for providing access to public records at NDAD. The effective deployment of this technique depends, in the end, on us being sure about the identity of a particular user. As I'll explain in my presentation, we don't have a satisfactory solution to that particular problem yet. However, the rest of our technique is still just as useful.
By selective disclosure, I mean that we sometimes have to deal with materials where not everything in them can be made available to readers straight away. As well as having records whose entire contents are closed for 30, 50 or even 100 years, we also have other material where only parts of it are closed in this way, with other parts being available straight away, or at least more rapidly than the most sensitive sections. This may be because a database contains information which is subject to data protection, or because it contains commercially sensitive information.
This process has been used for many years in record offices, where it is known, somewhat imprecisely, as redaction. You will probably all have seen examples of records released from the security services in this country and elsewhere where key names, places and other pieces of incriminating or sensitive information are blacked out. This process is, of course, tedious and expensive. It only tends to be carried out for material of great interest, or in those cases where the partial release of information is forced onto government by court action. Moreover, it requires the creation of multiple record copies, one to be released now with certain items removed, and the true unaltered copy to be stored awaiting an era when it can be released without incident.
Digital objects can have this process applied to them far more easily, and as a result we have been able to perform such selective opening of far more material than would be the case were the records on paper at the PRO. In my presentation, I'll describe in more detail exactly how we do this for documents and database records. I'll also illustrate that we can do this without needing to create multiple copies of anything. Indeed, this is essential, since different sets of people may be given access to arbitrary subsets of a particular set of records. The general public may see one thing, an accredited researcher may be allowed to see a little more, another government department even more and the depositing department will have the right of access to all the material.
Such controls need to be applied to metadata about preserved objects as well as to the objects itself. Such access controls on metadata do not seem to feature strongly in any of the schemes I have seen developed so far for metadata. They are, however, essential for many uses. Apart from the fact that we will often wish to close a whole set of metadata fields which we would consider internal and administrative, there are other instances where material which would be revealed for most objects cannot be revealed for some. In some cases, we cannot admit to the existence of objects at all except in the most vague of ways. Records relating to a trial involving incest, for example, cannot reveal in their metadata anything about the name of any party to the trial, as to do so would be to breach the anonymity of those wronged by the offence. We thus may have a need for multiple titles, the full title (which is probably closed for as long as the records themselves) and the title which will be used in the interim.