![]() |
|
|
|
|
|
|
|
|
|
||
![]() |
|||
| December 15, 2003, Volume 7, Number 6 | ISSN
1093-5371 |
||
|
Feature
Article 2 Highlighted
Web Site
FAQ RLG
News PDF/A:
Developing a File Format for Long-Term Preservation A committee of government, business, and academic representatives is exploring a promising approach to support long-term preservation of text-based digital documents. Originally sponsored in 2002 by the Association for Information and Image Management (AIIM) and the Association for Suppliers of Printing, Publishing and Converting Technologies (NPES), the committee prepared a draft preservation standard for Adobe’s Portable Document Format (PDF). Known as PDF/A, the potential standard intends to specify the use of PDF in a manner that is specifically geared to long-term management and use. PDF/A would be ideally suited for documents whose content and appearance must remain stable over long periods of time. A newly formed Joint Working Group (JWG) of the International Organization for Standardization (ISO) has accepted the draft as the basis for further development as an approved standard.[2]
One potential solution is to rely on text with markup language such as the Extensible Markup Language (XML) to preserve documents. This offers some important advantages, chiefly that textual content will achieve a degree of independence from specific information technology configurations. But use of XML does not always ensure reproduction of the original visual appearance of documents. This is a particularly significant issue in situations where a document in another format is migrated to XML. Textual content may be reasonably well represented in the XML version, but much of the original document’s formatting and layout will likely be lost. If a word processing document is moved to XML, for example, instructions relating to line and page breaks, font characteristics, footnote placement, margin width, and other format-specific elements will either not migrate at all or require support for complex (and often proprietary) style conventions. Encodings used for metadata, versioning, and other identification and tracking features may also not migrate. A further problem is that XML is sharply limited in its support for nontextual data such as photographs and other graphics. Some communities have already made the determination that, for reasons of authenticity and trustworthiness, it is necessary to retain both the content and physical appearance of digital documents. Other communities are interested in facilitating long-term preservation by relying on one file format throughout the digital life cycle. Most everyone is interested in a format that enables robust metadata.[3] For text-based objects, these interests can be summarized as a set of three basic requirements:
PDF addresses
most of these requirements. It is widely integrated into many document
producer work environments. Users are quite familiar with PDF from its
ubiquitous presence on the World Wide Web. Some cultural heritage institutions
favor PDF because it is based on a published specification; this permits
independent development of non-proprietary tools for rendering documents.
By publishing the specification, Adobe has managed to avoid a key preservation
problem with most other commercial software: barriers (technical as well
as legal) for users to decode information content contained in files.
In addition, the format retains the appearance and other features of digital
documents that may constitute significant properties (such as layout,
formatting, and “look and feel”). The most recent PDF version
also offers a rich metadata capability known as the Extensible
Metadata Platform (XMP), which is based on the XML and Resource Description
Framework (RDF) specifications of the World Wide Web Consortium (W3C). Despite its advantages, unrestricted PDF is not suitable as an archival format. Adobe controls its development and is under no obligation to continue publishing the specification for future versions. The format includes some features that are incompatible with preservation purposes. PDF documents, for example, are not required to be self-contained; certain fonts may be drawn from outside the file.[4] The work of the ISO PDF/A JWG committee is to define the set of PDF components that may be used and restrictions on their use to support long-term preservation. For example, the draft ISO PDF/A standard is distinct from PDF in that:
As currently written, the draft ISO PDF/A standard intends to specify a format for representing documents created natively in PDF, converted from other digital formats, or digitized from paper or microfilm. The standard will also support development of products that read, render, write, and validate conforming PDF objects. Sections currently are provided for file format (such as file header/trailer, string and stream objects, and other base elements that form the general file structure); graphics; fonts; annotations; actions (including treatment of hyperlinks); metadata; logical structure; and forms. The metadata section relies on XMP, which provides for broad and flexible document characterization. From an archival perspective, XMP shows much promise for purposes of description, provenance (e.g., history of the document and its context), preservation, and administration. There are also some key technical advantages for digital preservation because XMP metadata is 1) embedded in each file as plain text, which both lessens the possibility of loss and simplifies access, and 2) structured and represented in a manner that conforms to W3C specifications. As with XML and RDF, XMP permits user-defined schemas to describe metadata properties. This offers the prospect of rich metadata that is widely interoperable and interpretable over time. Currently XMP does not provide for machine-readable schemas, which severely limits validation of metadata against applicable schemas. A major problem here is the pending status of the RDF schema specification. W3C is, however, making progress toward formal approval of the specification. If approved as an ISO standard, PDF/A could have an important role in digital preservation. The format promises to be widely suitable for creating and distributing documents, recording evidence of transactions, searching and retrieving, and many other common uses. This means repositories will be able to ingest and manage documents in their original format, which is important from both a cost and authenticity standpoint. PDF/A will also support migration of documents from other formats for long-term retention. Notes [1] The author is reporting from the perspective of a member of the original AIIM/NPES PDF/A committee and the U.S. Technical Advisory Group to the ISO PDF/A JWG and is not representing any official position of the U.S. Library of Congress. (back) [2]Background and other details associated with the AIIM/NPES committee and its association with ISO is available; information about formal ISO status is available. (back) [3]A
sampling of these and other community interests is discussed in “E-Documents
Need E-Preservation,” Washington Technology, 3/3/2003.
(back) [4]For a more-extensive discussion of potential preservation issues associated with unrestricted PDF, see “Archiving and Preserving PDF Files,” RLG DigiNews, 2/15/01. (back) Research Agendas Set Course for Digital Archiving and Long-Term Preservation Margaret
Hedstrom Both reports stress the growing centrality of digital information in government, commerce, research and education, cultural heritage, and even interpersonal communications, as well as the inadequacy of current digital preservation strategies and methods to address challenges posed by increasingly complex digital entities. As the title suggests, It’s About Time identifies concern with “the long term” as one characteristic that distinguishes digital preservation research from research on digital libraries or storage technologies. The long-term perspective raises issues of technological obsolescence and evolving technologies, social and managerial concerns over the threat of interruptions in management of digital archives, and economic questions around the business and funding models needed to sustain digital archives over many generations. The report outlines priority areas for research intended to develop partnerships between academic researchers, researchers in the information technology sector, and program managers in government agencies who have responsibilities for managing and preserving large data collections.
These two reports offer plenty of suggestions for projects that will keep digital preservation researchers and curators of digital collections occupied for several years. There are already indications that some of the sponsors are soliciting proposals in this area. The recent call for proposals from the NSF Digital Government Program includes digital archiving as one key component. The European Commission will likely include digital archiving projects in its Information Society Sixth Framework with the possibility of more international projects. Notes [1]It’s About Time: Research Challenges in Digital Archiving and Long-Term Preservation, Final Report, Workshop on Research Challenges in Digital Archiving and Long-Term Preservation, April 12-13, 2002, sponsored by the National Science Foundation, Digital Government Program and Digital Libraries Program, Directorate for Computing and Information Sciences and Engineering, and the Library of Congress, National Digital Information Infrastructure and Preservation Program, August 2003. (back) [2]Invest to Save: Report and Recommendations of the NSF-DELOS Working Group on Digital Archiving and Preservation, prepared for the National Science Foundation’s (NSF) Digital Library Initiative and the European Union under the Fifth Framework Programme by the Network of Excellence for Digital Libraries (DELOS), 2003. (back) [3]NSF and DELOS sponsored seven additional joint working groups on topics of interest to the digital library community. All the working groups’ reports are available. (back)
FAQ What impact will the recent Librarian of Congress's rulemaking on the Digital Millennium Copyright Act's anticircumvention provisions have on the ability of libraries and archives to preserve access-controlled digital information? This FAQ is answered by Peter Hirtle. Hirtle is the Director of Instruction and Learning at Cornell University Library and also serves as the Library's Intellectual Property Officer. He is the immediate past president of the Society of American Archivists. Before we can answer this question, it is important to understand the background, rationale, and scope of the Librarian's rulemaking. The Digital Millennium Copyright Act (DMCA, enacted in 1998) gave copyright owners important new protections. One of them is that it made it illegal for users to bypass any technological mechanisms that the copyright owner may have placed on works that control access to those works, and imposed harsh civil and criminal penalties for knowingly circumventing the controls. Passwords are one form of access control; encryption (or scrambling) of a file is another. The prohibition applies even if the intended use is otherwise lawful and noninfringing. Recognizing that this provision might unduly affect the rights of users, Congress directed that every three years the Librarian of Congress should determine whether the implementation of access control measures is diminishing the ability of individuals to use copyrighted works in ways that are otherwise lawful. The focus of the rulemaking is on whether there are specific classes of copyrighted works the use of which is, or in the next three years is likely to be, adversely affected by the prohibition against bypassing access control mechanisms. On October 28, 2003 the Librarian identified from the numerous suggestions submitted by the public four classes of works that will be exempt for the next three years (until the next round of rulemaking) from the DMCA's prohibition against the circumvention of technology that controls access to a copyrighted work. The third exemption addresses a preservation use:
The third class of exempted works modifies an exemption proposed by the Internet Archive (pdf) intended to address the problem of preserving software. Some computer programs and video games will only operate in the presence of specific media or hardware. In some cases, the original media on which the program was distributed must be inserted into the appropriate drive in the computer for the software to operate. In other cases, a specific piece of hardware such as a dongle (a hardware lock that attaches to a computer and interacts with software programs to prevent unauthorized access to that software) must be present.
The Librarian's DMCA exemption allows libraries that wish to preserve such software legally to bypass access control mechanisms in the software. An important requirement is that the works must be in formats that are now obsolete. A format is considered obsolete if the machine or system necessary to render perceptible a work stored in that format is no longer manufactured or is no longer reasonably available in the commercial marketplace for new equipment. It seems likely that a computer program or video game that was distributed on an 8 inch floppy disk would now be considered obsolete. The situation is much less clear with 5 1/4 inch floppy disks. The Register of Copyright in her report to the Librarian of Congress left the issue open:
While it is important to understand what the Librarian's ruling permits, it is even more important to understand the limitations of the ruling. Even the Register recognized that many of the important concerns that librarians and archivists have about the preservation of our digital heritage would not be satisfied by the scope of this exemption.[3] First, the Librarian rejected the Internet Archive's recommendation that literary and audiovisual works in addition to computer programs and video games be included in the class of exempted works. The Register did not find conclusive evidence in the submitted comments that access control mechanisms that rely upon the original hardware or software are a significant problem for e-books, sound recordings, or other digital works. The exemption is limited solely to computer programs, defined in the Copyright Act as "a set of statements or instructions to be used directly or indirectly in a computer in order to bring about a certain result," and video games (which are undefined in the Act and recommendation). It cannot be used to preserve an e-book designed to be played only on a particular reader. Second, the exemption is limited to access control mechanisms that require the original media or hardware to operate. Other access control mechanisms, such as passwords, are not covered by the exemption. The regulation would not authorize, for example, circumventing password protection in a word processing or PDF document that might limit whether people can edit, print, or copy the document. The requirement that limits the exemption to obsolete hardware or media is also quite strict. Preservationists know that by the time a digital work's software or hardware environment is obsolete, it may be too late to preserve it. Nevertheless, the Register concluded that Section 108(c) does not authorize the categorical preservation of any works other than obsolete works; preemptive archival activity is expressly excluded.[4] Section 108(c) does also authorize the reproduction of "deteriorating" works, but the Register concluded that this factual determination must be made on a case-by-case basis. One cannot simply conclude that all works are deteriorating from the moment of creation. Preservation of digital works before they become obsolete may be permissible under Section 117 (Computer Programs) or Section 107 (Fair Use) of the Copyright Act. Section 117 authorizes the making of a reproduction of computer programs when "such copy or adaptation is for archival purposes only." For the purposes of the rulemaking, however, the Register followed some court opinions that have interpreted the language narrowly, redefining archival copies as backup copies only.
Lastly, the
Librarian's rulemaking does not affect the prohibition found in 1201(a)(2)
against the manufacturing or distribution of devices that can circumvent
access control mechanisms.[6] It may be legal for a library
or archives to create devices to help them circumvent an access control
mechanism during the next three years, but it is illegal for that library
or archives to purchase such a device from others. Nor can they share
a solution with other cultural institutions. Few repositories are in a
position to reverse engineer obsolete access control mechanisms. In many The DMCA exemptions are a chimera as far as preservation is concerned. Given the narrow scope in which the Librarian of Congress feels he must operate, the focus on classes of works rather than uses, the limited number of exempted classes of works, and the continuing ban against the tools that could bypass access control mechanisms, there is little that the exemptions can do to help librarians and archivists. There are steps, however, that the library and archival community can take to begin to address the gaps in the Librarian's rulemaking. First, we should begin to prepare now for the next round of rulemaking in 2006. Each round of rulemaking begins anew. If we want to preserve even the limited rights granted this time, it will be necessary to resubmit and re-establish the need for the existing exemptions. Furthermore, we can begin now to identify additional classes of works in which access-control mechanisms have limited our rights (especially our ability to exploit Section 108 rights). It is important that the library and archival communities identify concrete examples of when access control mechanisms interfere with preservation and to report those examples to groups such as the Society of American Archivists who are interested in the preservation of electronic information. Many proposed exemptions in this round of rulemaking were rejected precisely for lack of such concrete examples. Librarians and archivists should also continue to challenge the Librarian's narrow interpretation of his authority in the rulemaking proceeding. The Assistant Secretary for Communications and Information of the Department of Commerce, who is required by law to advise on the rulemaking proceeding, has noted that "in some circumstances, the intended use of the work or the attributes of the user are critical to a determination whether to allow circumvention of a technological access control." The law itself requires that the Librarian when conducting the rulemaking examine "the availability for use of works for nonprofit archival, preservation, and educational purposes."[7] The Librarian's continued narrow definition of what is meant by "classes of works" subverts the clear intention of Congress and the public interest in favor of the narrow interests of the intellectual property monopolies.
Notes [1]For more on the legal bases for digital preservation, see Peter B. Hirtle, "Digital Preservation and Copyright". (back) [2] Memorandum(pdf), Mary Beth Peters to James Billington, Recommendation of the Register of Copyrights in RM 2002-4; Rulemaking on Exemptions from Prohibition on Circumvention of Copyright Protection Systems for Access Control Technologies, 27 October 2003, p. 50. (back) [3] Ibid., p.63. (back) [4] Ibid. (back) [5]17 U.S.C. § 108(f)(4). (back) [6]Similarly, while it may sometimes be legal for individuals and repositories to make reproductions of digital files that contain controls against copying, it is illegal to distribute hardware and software that would allow you to do this. 17 U.S.C. § 1201(b)(1). (back) [7] 17 U.S.C. § 1201(a)(1)(C). (back) [8]Register's Recommendation, p. 63. (back)
Calendar of Events Three-Day
Program on Digital Libraries Maryland
Institute for Technology in the Humanities (MITH) TEI XML/XSLT Winter
School
"Implementing
the benefits of OAI," 3rd Workshop on the Open Archives Initiative
(OAI3) Call
for Proposals: The Illinois Online Conference (IOC) ECURE
2004: Preservation and Access for Electronic College and University Records
2004
International Conference on Digital Archive Technologies (ICDAT2004) Libraries
in the Digital Age (LIDA) The
International Association for Social Science Information Service and Technology
(IASSIST) Conference CLIR
and NIST Publish Guide to Care and Handling of CDs and DVDs
Consortium
of Heritage Groups Unveils New Online Resource NDIIPP
Publishes Findings on Research Challenges in Digital Preservation Launch
of AGORA (Access to Global Online Research in Agriculture) OAI-Rights
Effort Launched The
Xtensible Past Report
Evaluates Audiences for Digitized Cultural Content PREMIS
Announces Survey on Preservation Metadata Implementation UNESCO
Adopts a Convention on the Preservation of Intangible Heritage New
Publication on Digital Preservation RLG News RLG Forums
- To Have and To Hold: Metadata and Institutional Repositories Speakers in the morning session covered a wide range of metadata standards used to describe, reveal, and deliver electronic information resources. Updates on metadata standards and standards activities were detailed. The presentations, topically grouped into description and discovery, management and delivery, and "mixing and matching," set the stage for the afternoon discussion of digital repositories and how the repositories discussed use metadata. Afternoon speakers discussed options available for digital repositories to manage and store digital information. Speakers discussed their implementations of Open Source software such as DSpace and FEDORA, as well as other local, institutional repository infrastructure. Especially helpful to attendees was information about the choices other RLG members made regarding the preservation of digital materials at their institutions, how well their choices have met their needs, and what lessons have been learned Copies of
the presentations are available on the RLG web site at http://www.rlg.org/events/haveandhold2003/. Publishing Information RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org. Materials in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given to use material found here for research purposes or private study. When citing RLG DigiNews, include the article title and author referenced plus "RLG DigiNews." Any uses other than for research or private study require written permission from RLG and/or the author of the article. To receive this, and prior to using RLG DigiNews contents in any presentations or materials you share with others, please contact Jennifer Hartzell , RLG Corporate Communications. Please send comments and questions about this or other issues to the RLG DigiNews editors. Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); Technical Researcher: Richard Entlich; Contributor: Erica Olsen; Copy Editor: Martha Crowe; Production Coordinator: Carla DeMello; Assistant: Valerie Jacoski. All links in this issue were confirmed accurate as of December 15, 2003.
|
|||
| |
|
|
|
|
|
|
|
|