HomeAboutProjectsProducts & ServicesPublicationsSupport
RLG Logo
  Issue index
 
 
· Apr 15, 2007
 
 
· Dec 15, 2006
 
 
· Oct 15, 2006
 
 
· Aug 15, 2006
 
 
· June 15, 2006
 
 
· Apr 15, 2006
 
 
· Feb 15, 2006
 
 
· Dec 15, 2005
 
 
· Oct 15, 2005
 
 
· Aug 15, 2005
 
 
· Jun 15, 2005
 
 
· Apr 15, 2005
 
 
· Feb 15, 2005
 
 
· Dec 15, 2004
 
 
· Oct 15, 2004
 
 
· Aug 15, 2004
 
 
· Jun 15, 2004
 
 
· Apr 15, 2004
 
 
· Feb 15, 2004
 
 
· Dec 15, 2003
 
 
· Oct 15, 2003
 
 
· Aug 15, 2003
 
 
· Jun 15, 2003
 
 
· Apr 15, 2003
 
 
· Feb 15, 2003
 
 
· Dec 15, 2002
 
 
· Oct 15, 2002
 
 
· Aug 15, 2002
 
 
· Jun 15, 2002
 
 
· Apr 15, 2002
 
 
· Feb 15, 2002
 
 
· Dec 15, 2001
 
 
· Oct 15, 2001
 
 
· Aug 15, 2001
 
 
· Jun 15, 2001
 
 
· Apr 15, 2001
 
 
· Feb 15, 2001
 
 
· Dec 15, 2000
 
 
· Oct 15, 2000
 
 
· Aug 15, 2000
 
 
· Jun 15, 2000
 
 
· Apr 15, 2000
 
 
· Feb 15, 2000
 
 
· Dec 15, 1999
 
 
· Oct 15, 1999
 
 
· Aug 15, 1999
 
 
· Jun 15, 1999
 
 
· Apr 15, 1999
 
 
· Feb 15, 1999
 
 
· Dec 15, 1998
 
 
· Oct 15, 1998
 
 
· Aug 15, 1998
 
 
· Jun 15, 1998
 
 
· Apr 15, 1998
 
 
· Feb 15, 1998
 
 
· Dec 15, 1997
 
 
· Aug 15, 1997
 
 
· Apr 15, 1997
 
 


Click for printable version of this pagePrintable Version
 Contents of: Volume 8, Number 2 ISSN 1093-5371  Print entire issue
  Feature Article 1: Treasuring the Digital Records of Science: Archiving E-Journals at the Koninklijke Bibliotheek  
  Feature Article 2: Computational Linguistics Meets Metadata, or the Automatic Extraction of Key Words from Full Text Content  
  Feature Article 3: Character Sets and Character Encoding: A Brief Introduction  
  ECURE Summary: ECURE Presents Diverse Views of Preservation and Access  
  Highlighted Web Site: LEADERS: Linking EAD to Electronically Retrievable Sources  
  Calendar of Events  
  Announcements  
  RLG News: New Look and Functionality for RLG DigiNews; Upcoming RLG Forum  
  Publishing Information  
 Feature Article 1  Print this article only

Treasuring the Digital Records of Science: Archiving E-Journals at the Koninklijke Bibliotheek

Author: Johan F. Steenbakkers - Koninklijke Bibliotheek, National Library of the Netherlands (johan.steenbakkers@kb.nl)

 

Overview

The Koninklijke Bibliotheek (KB) is the national and depository library of the Netherlands. The depository collection relies on voluntary arrangements with the Dutch publishers, not legislative mandate. In 1994, the KB decided to include e-publications with Dutch imprint in its deposit collection and renewed arrangements with the Dutch publishers. To accomplish this, the KB developed a dedicated infrastructure for processing and safekeeping of the e-publications. In 2002, the KB took the step to include international scientific e-journals in its deposit collection by signing the first formal archiving agreement with Elsevier Science. By doing so the KB became treasurer of an important part of the digital Records of Science. This responsibility implies an ongoing search for solutions for preservation and permanent access.

On August 20, 2002, at the Conference of the International Federation of Library Associations and Institutions (IFLA) in Glasgow, Elsevier Science and the Koninklijke Bibliotheek announced a groundbreaking electronic archiving agreement between publishers and libraries worldwide. The need to provide for permanent digital archiving has been evident to libraries [1] and to Elsevier for several years and Elsevier had been a leader in advocating publisher responsibility in this area. In 1999, Elsevier Science made a public commitment to ensure digital archiving with a trusted repository as part of its license with library customers.

Publisher’s View

The KB was a logical partner, well-known as a leader worldwide in experimentation and investment in digital preservation. Karen Hunter, Senior Vice President, Strategy at Elsevier and responsible for this digital archiving initiative, explains the relevance of this agreement:

“It is essential that we will be able to guarantee both authors and researchers using the journals that the electronic files will be permanently available. Journals have been called ‘the minutes of science.’ As we move toward journals being available only in electronic form and being held centrally on publishers’ computers, the public has the right to be assured that, should a publisher go out of business, these files will not be lost. This agreement provides that assurance for Elsevier Science titles, which constitute an essential part of the core scientific literature currently published.”

Librarian’s View

Research and development on long-term digital archiving has been top priority in the KB. “Ensuring permanent availability of information and knowledge, is at the heart of the KB's mission,” declared Wim van Drimmelen, Director General of the KB.

“Digital archiving is a logical extension of the role we always had and will have in the area of printed material, the modern version of a traditional task. In this era of electronic publishing new arrangements are needed globally in order to preserve our intellectual heritage. The KB wants to take an active part in these evolving new arrangements. It's an exciting challenge to find ways of coping with the fast pace of change in platforms and formats. From the start we committed ourselves strongly to this challenge. We take pride in this groundbreaking agreement with Elsevier and see it as a recognition of our achievements so far and a milestone on the way to our strategic goals.”

KB’s e-Depot

e-Depot is the name of the organization and infrastructure at the KB for archiving e-publications. The purpose of e-Depot is to ensure long-term availability of the digital files (the bits and bytes) and permanent access to the content (the information) captured in the files.

Within the organization of the KB three divisions are jointly responsible for running and developing the e-Depot: Acquisition & Processing Division, Information & Communication Technology, and Research & Development Division. The Acquisition & Processing Division is in charge of acquiring, processing, and archiving of e-publications. e-Depot is a special unit within this division and in charge of the day-to-day operations of obtaining, checking, and loading the e-publications, including their metadata. The Division for Information & Communication Technology is responsible for the technical maintenance of the infrastructure for the e-Depot. This task includes expanding the storage capability, guaranteeing backup, and providing media migration. This division also manages integration of the deposit system within the digital library infrastructure for cataloguing, search and retrieval, user registration, etc.

The Research & Development Division performs studies and experiments to develop and maintain the functionality of the e-Depot. These activities are usually joint projects with the two divisions mentioned before. External technology partners are often involved. The Research & Development Division also organizes or participates in international activities (e.g., development of standards, preservation studies and projects, conferences). For these activities a dedicated research unit named "Digital Preservation" has been created.


Figure 1. Organizational Structure of the Koninklijke Bibliotheek (click on image for larger view)

To coordinate the activities and policy development concerning the e-Depot, the KB has implemented the e-Depot Steering Board, chaired by the Director of Information Technology and Facility Management. In addition to the three divisions already mentioned, the User Services Division also participates in the board. This division is in charge of providing access to the e-publications under conditions specified by the publishers. Because of the strategic impact of the e-Depot on the KB’s policy and organization, the Director General of the KB usually takes part in the board meetings.

e-Depot Infrastructure

The infrastructure of the e-Depot consists of both components that were specifically developed for processing, archiving, and maintaining e-publications, and typical digital library functions. According to the NEDLIB Guidelines, [2] the deposit system should be a separate, dedicated entity within the library’s digital infrastructure. For the traditional library processes, such as cataloguing, search and retrieval, and user registration and authentication, the KB uses the provisions already in place. So these functions have not been duplicated within the deposit system. This approach allows both the depository system and the traditional library systems to evolve at their own pace (e.g. in terms of new functionality and technical updating). Separate entities for e-archiving and for the traditional library also work to keep matters as simple as possible, both for the library and for the system providers.

The deposit system DIAS (Digital Information Archiving System) is the technical core of the e-Depot. The functions at the left of Figure 2 are for receiving and loading: EPO is the Electronic Post Office; BER is the Basic Error Recovery; NBN is the National Bibliographic Number generator. The functions at the right are for search, retrieval, and delivery: GGC is the Central Cataloguing System of Pica/OCLC; KB-TITEL is the local overall catalogue database at the KB; IAA is the function for Identification, Authentication, and Authorization of end users.

Figure 2. The Deposit System Within the Digital Library Environment

The Depository Task Extended

As the national library of the Netherlands, a key role of the KB is to serve as the depository library for publications produced by the country. In the early 1990s it became clear that, after about two decades of experimentation by publishers, e-journals were getting off the ground. Having determined in 1994 to include electronic publications in its deposit collection, the KB initiated discussions with Elsevier Science (ES) in 1995 about depositing e-copies of the ES journals with a Dutch imprint. By 1996 a preliminary agreement was signed and the first e-journals—a total of 315 in the end—were deposited at the KB. Finally, in 1999 the Dutch Publisher's Association and the KB made an arrangement implying that publishers would deposit all electronic publications with Dutch Imprint at the KB. The arrangement covers offline and online publications and prescribes restricted access conditions.

In August 2002, ES signed the archiving agreement with the KB to ensure permanent archiving of all their electronic publications, most of them e-journals. ES is prepared to establish formal archival agent relationships internationally with a limited number of libraries or other institutions, such as the KB.

The e-publications in the KB deposit can be used onsite by persons authorized and registered by the KB as pass holders. Usage is also allowed for print or fax copies of articles for interlibrary loan within the Netherlands. The KB may open access to the journals to users in general in the case that neither ES nor a successor offers these publications to customers. Also open access may be offered to certain journals or publication years upon notice from ES. Information about the e-publications may be included in the KB’s online public catalogue or in the National Bibliography.

Currently three international publishers have signed an archiving agreement with the KB: Elsevier, Kluwer Academic, and the open access publisher Biomed Central. Agreements with more scientific publishers are in preparation. The decision of the KB to establish a formal archival relationship with international publishers builds on the national depository task. By extending this task to include international e-publications (at the moment mostly in science technology and medicine), the KB intends to contribute to the development of a global solution for safeguarding these e-publications. A global solution is needed because, for international e-publications, the traditional approach—national deposit and national bibliographic control—is no longer valid. To be sustainable, global depositing must eventually be based on new business models that take into account the permanent effort, and hence costs, of digital archiving. I have suggested earlier [3] that these costs should be an integral part of the costs of e-publishing. The experience at the KB has shown that once an e-deposit is in place, the costs to scale up the infrastructure and organization to include more publications are fairly modest.

e-Depot & Dutch Academic Repositories

In the Netherlands, Dutch universities, the KB, and three other academic institutions co-operate with the SURF Foundation (the foundation for the national science data network) in project DARE (Digital Academic Repositories). [4] The aim of DARE is to create an infrastructure of institutional repositories that will enable digital recording services, access, storage, and distribution of the Dutch academic output. The DARE infrastructure will closely interface with the e-Depot so that the published electronic academic output will be archived and preserved for the long term. Specific procedures and technological solutions will be developed, including provisions for return delivery, from the e-Depot to the repositories, of a copy of the original e-publication or a preserved and accessible copy.

Developing a Dedicated Deposit System

To handle the electronic publications, the KB needed a deposit system. In 1996 a first pilot system was developed in co-operation with AT&T. This pilot system was replaced in 1998 by a larger pilot system (up to 2 Tb storage) that was developed in cooperation with IBM. After several years of experiments and studies, a list of requirements for an operational deposit system could be compiled. A market scan had shown that a deposit system could not be bought off the shelf, so in 2000, the KB decided to tender for the development of one. Through a European tender procedure, IBM was selected as the best technology partner. The system was created on site at the KB premises. In October 2002, DIAS (Digital Information Archiving System) was handed over to the KB.


Figure 3. DIAS Configuration and OAIS (click on image for larger view)

The functional design of the DIAS is based on a standard for digital archiving, the Open Archival Information System Reference Model (OAIS-RM)/ISO 14721:2003. The system is designed to be durable; and provides for scalability, extensibility, and flexibility. It was built using off-the-shelf components as much as possible. [5] Figure 3 represents the functions of the system developed together with IBM for the e-Depot. The design of the system complies with the OAIS model, the OSI standard for digital archives that is shown on the background of IBM's functional design.

The key functions of DIAS are storage and long-term preservation. It allows the manual and automated ingest of digital publications. Once the publication is successfully stored, it is managed for preservation and permanent access. The preservation functionality is at the moment being developed further. For details about the configuration of DIAS delivered in 2002 to the KB, see the LTP report #1.

e-Depot Statistics

The Deposit System is capable of ingesting over 60,000 articles (mostly PDF) a day. The articles and their metadata are checked, processed for loading, and stored. The descriptive metadata are also copied to the KB catalogue database (see KB-TITEL in figure 2) for search and retrieval purposes.

  2003 2004 (growth) 2004 (prognosis)
e-journals 1.2 Tb 1.8 Tb 3.0 Tb
CD-ROMs 0.7 Tb 1.3 Tb 2.0 Tb
Total storage 1.9 Tb 3.1 Tb 5.0 Tb
e-journal titles 1,200 1,400 2,600
e-journal articles 1,600,000 2,900,000 4,500,000

Table 1. Terabytes of Storage Used and Quantity of Content by Type in the e-Depot, 2003-2004

Studying Long-Term Preservation

The contract for developing the deposit system included a joint research obligation, referred to as the ‘Long-Term Preservation Study.’ The research work was necessary because at that time the KB could not define specific enough requirements for preservation to demand development and delivery of the preservation functionality of the deposit system. It was agreed that IBM would take into account the results of the research effort when designing the depository system.

As a result of the Long-Term Preservation Study, a preliminary module for preservation could be realized. In addition, six reports were published in December 2002 summarizing the research results. The reports can be ordered in print from the KB or IBM, and are also available in PDF on the KB site. The titles of the reports illustrate the variety of preservation issues that have been covered:
1: The Long-Term Preservation Study of the DNEP Project—an Overview of the Results
2: Authenticity in a Digital Environment
3: Preservation Requirements in a Deposit System
4: The UVC: a Method for Preserving Digital Documents–Proof of Concept
5: Managing Media Migration in a Deposit System
6: Archiving Web Publications

After the deposit system was delivered and implemented in 2002, the KB continued at a modest scale [6] with the research on digital preservation. In 2003, KB and IBM worked on designing and developing further functionality for preservation management and for permanent access. The result is a further detailing of the ‘preservation planning’ function of the OAIS model into a Preservation Subsystem. In figure 4 the three components envisaged within the Preservation Subsystem are shown: the Preservation Manager, the Preservation Processor, and the Permanent Access Toolbox. Starting in 2003, a first version of the Preservation Manager has been developed. The Preservation Manager will be tested soon and, if appropriate, will be implemented within the e-Depot in 2004. [7]

Another result is a first permanent access tool, based on Raymond Lorie’s Universal Virtual Computer concept (see Long-Term Study Report 4). The tool enables one to view images in the future, regardless of any change in technical circumstances. [8] The development of more permanent access tools will need continuous dedicated research and development.


Figure 4. Preservation Subsystem in the OAIS Model (click on image for larger view)

Promoting Digital Preservation in Practice

The challenge of preserving digital information and guaranteeing permanent access to it can only be addressed successfully by realizing a long-standing and close co-operation of three key-players: leading memory institutions (national libraries and archives), main producers of information (publishers and public agencies), and, last but not least, leading IT companies. The development of the e-Depot at the KB together with the science publisher Elsevier and IT-company IBM is a good example of such a co-operation. These three partners have been breaking new ground in the functional, technical, and policy area, in order to develop permanent availability of digital information. It is hoped that more major players in the areas mentioned will actually take up their responsibility for digital preservation and start pushing back frontiers.

Notes

[1] National Library of the Netherlands and Elsevier Science make digital preservation history. Permanent digital archive assures perpetual accessibility of scientific heritage. Press release, Glasgow, 20th August 2002, by Elsevier Science and the Koninklijke Bibliotheek. (back)

[2] Johan Steenbakkers. The NEDLIB Guidelines. Setting up a Deposit System for Electronic Publications. NEDLIB Reports Series 5, November 2000, Koninklijke Bibliotheek. (back)

[3] Johan F. Steenbakkers. Digital archiving: a necessary evil or a new opportunity. Serials Review, 30/1, pp. 29-32, 2004. (back)

[4] For more information on DARE see www.surf.nl.(back)

[5] About DIAS see www-5.ibm.com/nl/dias.(back)

[6] In April 2003 a consortium of libraries, archives and IT companies unsuccessfully turned to the European Commission for financial support for an integrated preservation research project under the title PATCH (Permanent Access Toolbox for the digital Cultural Heritage). (back)

[7] Raymond J. van Diessen, Erik Oltmans and Hilde van Wijngaarden. Preservation Functionality in a Digital Archive. To be published in the proceedings of the Joint Conference on Digital Libraries 2004, Tucson, Arizona, June 2004. (back)

[8] Hilde van Wijngaarden and Erik Oltmans. Digital Preservation in Practice: The UVC for Images. To be published in Proceedings of the IS&T Archiving Conference, San Antonio, Texas. April 23rd, 2004. (back)
 Feature Article 2  Print this article only

Computational Linguistics Meets Metadata, or the Automatic Extraction of Key Words from Full Text Content

Authors: Marilyn Deegan, Harold Short - King’s College London (marilyn.deegan@kcl.ac.uk,harold.short@kcl.ac.uk), Dawn Archer, Paul Baker - Lancaster University (d.archer@lancaster.ac.uk,p.baker@lancaster.ac.uk), Tony McEnery, Paul Rayson - Lancaster University (eiaamme@exchange.lancs.ac.uk,paul@comp.lancs.ac.uk)

 

Introduction

For the past year, the Centre for Computing in the Humanities (CCH), at King’s College London and the Forced Migration Online team at the Refugee Studies Centre, University of Oxford have been working together to investigate the use of computational linguistics techniques for extraction of keywords from full-text content in a pilot project funded by The Andrew W Mellon Foundation.

The starting point for this project was the premise that it is easier to digitize large volumes of textual data than it is to create bibliographic records, and that it is particularly time consuming and expensive to add intellectual data such as keywords and abstracts. Our engagement with these issues grew out of several years work on the development of and investigation into hybrid and digital libraries through the Malibu project (Managing the hybrid Library for the Benefit of Users, and the Forced Migration Online digital library, in both of which the Centre for Computing in the Humanities at Kings College London and the Refugee Studies Centre at Oxford University were centrally involved. The Malibu project ran from 1998 to 2001; Forced Migration Online has been ongoing since 1997.

A great deal of progress has been made in automating the capture of full text from printed documents by the production scanning of print originals or surrogates followed by the application of advanced optical character recognition (OCR) algorithms. Once text is produced, sophisticated systems for full-text search using pattern matching or fuzzy matching offer excellent retrieval. However, the use of bibliographic descriptions, the addition of keywords to a document, and the application of topics trees and other taxonomic devices are still needed to improve precision and recall, and these meta activities generally still need a great deal of human time, effort, and skill. Some elements of bibliographic description are relatively easy to add to a documentary source, but the addition of keywords and other classificatory information labor intensive, costly, and can also be highly subjective. Taxonomy is intellectually demanding, and although there are many well-formed classification schemes and subject thesauri, assigning the terms is still a manual process. An ancillary problem is that as subject areas grow and change, the classification schemes need continual updating, so a circular process exists where thesauri inform classification and new classifications in turn inform the further development of the thesauri.

Some argue that adding value to content is unnecessary as a search engine will always find it, but a) that depends on knowing what you are searching for and b) can result in over-retrieval. As a recent commentator has remarked:

"Is it time to detach from our reliance on search engines? Consider the reality of relying on your favorite search engine. You're applying a pretty dumb technology (search algorithms) against a huge, undifferentiated pile of randomly selected, unorganized content; then adding billions of dollars of keyword-matched ads to the sorted output. Moreover, the effect over time of persistent ad placement in search results is to push those Web resources that lack the capacity or interest in placing ads further down the search results list and out of sight of most searchers."

What we were interested in is the use of intelligent algorithms that have been developed according to some statistical and/or linguistic principles to aid, not in the searching, but in the classification and keyword extraction processes, thereby gaining the benefits of automation with the precision of human-generated work. Were we successful? Read on …

The Subject Matter: Grey Literature about Forced Migration

forced migration online

Screenshot of Forced Migration Online Web Site (click on image for larger view)

Forced Migration Online is a portal that provides access to a wide variety of online resources dealing with the situation of refugees and forced migrants worldwide. Designed for use by practitioners, policy makers, researchers, students, or anyone interested in the field, Forced Migration Online aims to give comprehensive information in an impartial environment and to promote increased awareness of human displacement issues to an international community of users. There is a great deal of content, with some 80,000 pages of full text content in the digital library and journals, as well as several thousand records in a Web catalogue and organizations directory. This content forms an excellent test set of diverse types of information based on one particular domain. What is particularly interesting here is that the content derives from many different kinds of agencies and individuals—the academic sphere, governmental organizations, non-governmental organizations, the press—and this has significant effects on the results of the various trials to extract keywords from them. Content on forced migration and refugee issues outside of Forced Migration Online has also been used in some of the trials described below. Up-to-date news from the UNHCR Web site as well as current newspaper content has also been analysed.

The full text content on Forced Migration Online outside of the journals is largely grey literature that presents particular problems for bibliographic description, classification, and assignment of subject terms and keywords. Organizations such as the Refugee Studies Centre have built collections of grey literature explicitly because it is rare, difficult to get hold of, and difficult to find in major library collections. Given the particular nature of this growing field, too, classification and thesaurus support are weak in the major classification schemas and thesauri, and it is generally not possible to obtain records from the major suppliers.

Thesaurus Development

A key input for the trials on keyword extraction described below is the UNHCR Thesaurus of Refugee Terminology (ITRT), which was designed to facilitate information retrieval and exchange. In print since 1988, this has hitherto only been available in paper form. In 2003, in parallel with the Keyword Project, the UNHCR Library [1] and FMO began discussing how to create a Web-based version of the Thesaurus that a) would be more responsive to the needs of its users and b) could be used in the course of the keyword extraction trials.

International Thesaurus of Refugee Terminology

Screenshot of the International Thesaurus of Refugee Terminology Web Site (click on image for larger view)

It was decided to move very rapidly towards the development of the online version of the Thesaurus and so FMO and UNHCR commissioned Oxford ArchDigital, an Oxford University spin-out company, to develop the resource using their ToadHMS product, a customizable content management system.

The Thesaurus is now available as an interactive and searchable tool online, in English, French, and Spanish. Launched in December 2003, this new version is already serving as a more efficient medium for identifying relevant indexing terminology and as a value-added mechanism for managing refugee- and forced migration-related information. The Thesaurus was ready just in time for the trials undertaken by the Lancaster teams. These trials are discussed below.

Pilot Project Research Methods

With a focus on forced migration generally and the FMO collections in particular, we started out to investigate the following questions:

1. How might key terms be extracted from bodies of digital library materials in order to provide rich metadata that does not need to be human-generated?
2. How would these terms relate to the semantic environment that a thesaurus provides?
3. How could term extraction be ‘improved’ by use with thesauri, and thesauri ‘improved’ by term extraction?
4. What other work is being carried out that might inform developments in this area?
5. Who are the key players?
6. What commercial products are available?

The activities by which we carried out this study have been desk research; testing of data to prove concepts; extensive consulting in the community; and an expert workshop to discuss findings and the way forward.

keyword project

Screenshot of the Keyword Project Web Site (click on image for larger view)

Desk research and consultation yielded a great deal of information about this field. This is presented on the project Web site. The rest of this article reports on the remarkable results of the testing of data by the Department of Linguistics and Modern English Language and the Department of Computing at Lancaster University.

The One-Month Challenge

At the end of December 2003, and after some initial discussions, the Departments of Linguistics and Computing, Lancaster University, agreed to carry out two separate investigations on refugee material in time for a workshop to be held at the beginning of February 2004. [2] The short timescales for this led to its becoming known as the ‘one-month challenge!’ Given the emphasis on keyword analysis, members of University Centre for Computer Corpus Research on Language (UCREL) carried out the challenge. UCREL is a cross-departmental research center that specializes in the automatic/computer-aided analysis of large bodies of naturally occurring language. In one experiment, Archer and Rayson semantically annotated material provided by the Forced Migration Online team, using the UCREL Semantic Annotation System (USAS), a software package for automatic dictionary-based content analysis. In the other, Baker and McEnery collected their own refugee data from the news section of the UNHCR Web site and from online newspapers, and performed keyword analyses on that data, using Wordsmith [3] and similar tools.

Each team agreed to work separately, and to keep their findings secret until they presented their respective results at the Keywords Workshop. The results were remarkable, as was the occasion, since, neither team knew what the other team was going to say. What surprised everyone (the presenters included) was the close correspondence between the results of the two experiments.

The Archer and Rayson Trial

Archer and Rayson investigated in detail the benefits of semantically annotating refugee material, using the UCREL (University Centre for Computer Corpus Research on Language) system (henceforth USAS) and the feasibility of mapping the semantic domains of USAS to the classes used in the UNHCR Refugee Thesaurus. The USAS system is designed to undertake the automatic semantic analysis of present-day English texts (spoken and written), and this involves two stages:

(i) A part-of-speech tag is assigned to every lexical item or multi-word expression (MWE), using probabilistic Markov models of likely part-of-speech sequences (- 97% accuracy).

(ii) Output is fed into SEMTAG, which assigns semantic field tags on the basis of pattern matching between the text and two computer dictionaries developed for use with the program, and then applies a set of disambiguation techniques intended to select the correct semantic tag on each item given its context (- 92% accuracy).

The present applications of the system include linguistic analysis, market research, content analysis, information extraction, and assistance for translation. USAS (via its Web interface called Wmatrix) is a quantitative content analysis tool that can automatically: (i) measure/compare the frequency of occurrence of different domains; (ii) provide statistical information regarding key concepts; (iii) provide a record of the vocabulary resources for those domains. It offers, therefore, a useful means of assessing the differing themes, concerns, attitudes (and mindsets/world views) of various texts/authors/institutions.

Results of Frequency Comparison for UNHCR data

Results of Frequency Comparison for UNHCR Data (click on image for larger view)

The trial on the Forced Migration Online data involved mapping top-level categories of the ITRT onto the USAS categories, and then analysing a number of documents provided by FMO. These documents were drawn from different domains and agencies, categorized as UNHCR, Federation of the Red Cross, Government agencies (general), NGOs (general) and Academic (mostly FMO grey literature). The total number of words in the document set was 432,317.

The results of this investigation were most promising. For example, preliminary findings of the mapping of the top-level categories of USAS and ITRT were extremely successful. Moreover, Archer and Rayson believe that, if more fine-grained mapping between the two systems were undertaken (i.e., mapping between sub-categories/classes), USAS might provide a means of linking the ITRT and the FMO, so that FMO users could search the FMO data using ITRT classes or domains. In terms of analysing data, USAS identified a number of terms that were not represented in the Thesaurus (even though they represent important topics in the documents), thus proving the value of automated techniques for "improving" thesauri: one of the stated intentions of the pilot project. The USAS-based analysis also identified attitudinal factors represented by the different agencies, as well as the topic categorizations of the documents, all of which suggest that FMO users can gain much from searching the FMO data using more general semantic categories (such as those utilised by USAS).

In summary, the USAS system is able to categorize data within Forced Migration Online with a considerable degree of accuracy, and to assign keywords to that data at least as well as human cataloguers. Moreover, large volumes of data can be processed rapidly and systematically.

The Baker and McEnery Trial

In their trial, Paul Baker and Tony McEnery made comparisons between news on refugees as reported by UNHCR on their Web site throughout 2003 and news on refugees as reported in a wide range of British newspapers during 2003. The analyses were carried out using the corpus analysis software package, WordSmith Tools. The results were again most impressive. Themes and ideas could be extracted readily from the texts, as with the Archer and Rayson trial, but what became apparent here was the difference in tone between UNHCR (an intergovernmental agency) and the press. Overall UNHCR used a neutral tone of reporting, while the press used highly emotive, persuasive, and manipulative terminology. Baker and McEnery found very different discourses in different representations of reality, and observed that data matters when constructing resources to reflect the world, for different discourses represent different worlds.

It was a graphic demonstration of what we had hoped for in the project, which is that computational linguistics techniques could be used to extract accurate keywords from digital library content in a meaningful way. What was most illuminating for those present at the initial presentations of these two papers was a) the uncanny correspondence between the themes, ideas, tones, and keywords extracted by the different methods on different corpora representing the same subject domain, and b) the ability to extract from the texts matters that are of urgent and current concern to those dealing with forced migration. Those of us who were familiar with the field were struck forcibly by the accuracy with which the corpus linguists who had hitherto had little exposure to this area could present the issues. It was a graphic demonstration of what we had hoped for in the project, which is that computational linguistics techniques could be used to extract accurate keywords from digital library content in a meaningful way.

Conclusion

What has been particularly interesting and productive about the project is that it has involved the collaboration of individuals and research groups from a number of different domains who have rather different methodological perspectives, and who do not normally engage closely with each other—viz computational linguists, information specialists, digital library specialists, humanities computing specialists, system designers and specialists in forced migration. This led to some terminological misunderstandings, and there was much discussion during the course of the work about differences in terminology between different domains and approaches. For instance, the word ‘thesaurus’ can mean a list of controlled terms in a library cataloguing environment and a list of synonyms and antonyms (e.g., Roget’s Thesaurus). A full list of project participants is available online. [4]

The success of the experiments has surprised all those who have taken part in them. The various propositions we started out with were confirmed, and the tagging and analysis systems worked with remarkably little trouble. The teams are planning a number of follow-up projects, including mapping the USAS system to the ITRT and using this as a browse tool for Forced Migration Online, and testing the systems on different domains of grey literature.

Notes

[1] For details see the UNHCR Library. (back)
[2] Information is available online for the workshop program, participants list, report and all workshop presentations. (back)
[3] See Scott, M. (1999) WordSmith Tools Help Manual. Version 3.0. Mike Scott and Oxford University Press for details about WordSmith. (back)
[4] A major publication is being produced by the project. Current versions of the essays are available online. (back)


 Feature Article 3  Print this article only

Character Sets and Character Encoding: A Brief Introduction

Author: Ardie Bausenbach - Library of Congress (abau@loc.gov)

Unicode characters Introduction

The generation and processing of descriptive metadata in this time of transitioning data structures and formats creates new challenges for libraries and other cultural institutions. Whether we use MARC, MARC XML, MODS, Dublin Core, VRA Core, ONIX, OAI-PMH, EAD, TEI, FGDC, or other metadata schemas, our standard practices and toolsets are evolving to meet the demands of today’s digital environment. For many librarians and archivists, one of the more complex and confusing areas of this development is the handling of character sets and character encoding information. New technical terminology permeates the documentation. The impact of character encoding changes in our data environment may be unclear. Yet, as our metadata becomes globally accessible through the Internet and other Web services, we must develop a basic understanding of character encoding concepts and their impact on metadata interoperability.

Unicode characters Character Encoding Basics

Every character we see on our computer screens—whether it’s a letter, a number, a mark of punctuation, or a symbol—is stored as a sequence of binary numbers (0s and 1s). Creating, storing, displaying, searching, sorting, and transmitting textual information in all the world’s languages, however, requires thousands of characters. To meet this need, software applications have relied on hundreds of coded character sets built on multiple national standards. Successfully exchanging textual data in this environment has been problematic. The solution: Unicode and its sister standard ISO10646, standards that have developed a universal character set for all the world’s scripts. [1]

English-language software for personal computers has traditionally been based on 7- and 8-bit (or 1-byte ) coded character sets, which can represent a maximum of 256 characters. The most common of these character sets, ASCII, contains 128 character codes for the basic Latin alphanumeric characters. While fine for English, ASCII does not support the accented letters used in Western European languages, nor does it define characters in non-Latin scripts.

Unicode charactersTo meet the needs of different user communities that want to write in their own languages, personal computer software application vendors, such as Microsoft and Apple, have supplemented ASCII with additional characters. Various national standards bodies have also approved coded character sets that extend ASCII to 256 single-byte characters, and beyond. The end result: hundreds of customized, often language-specific character sets, each of which encodes different characters in code positions beyond the 128 found in ASCII. More than 26 character sets have been created for the Latin script alone, including: ISO 8859-1, Windows-1252, and ASCII/ANSEL. Single- and multibyte character sets have also been produced for non-Latin scripts, including Code Page 1251 for Cyrillic, Code Page 932 for Japanese (Shift-JIS), and the Windows Glyph List 4.0.

No one coded character set beyond ASCII has dominated the marketplace; rather, regional preferences prevail. A variety of keyboard layouts, fonts, and input method editors (or IMEs) are required to support each language’s character sets. Character codes obviously conflict between character sets. The same character code can be mapped to different characters, and the same character mapped to different character codes. Even then, not all characters and scripts can be recorded by computers. Whenever text files encoded with different character sets are exchanged between applications or platforms, the data must be converted before it is useable. When conversions fail, data is garbled or corrupted.

Unicode charactersUnicode, a Master List of Characters

To resolve this Tower of Babel, the International Organization for Standards (ISO), the International Electrotechnical Commission (IEC), and the industry-backed Unicode Consortium synchronized efforts to develop a master list of characters in the world’s writing systems. The result was a single comprehensive universal character set, first documented in the 1991 publication of “The Unicode Standard.” Although the ISO/IEC standard (ISO 10646) and the Consortium standard (Unicode) remain separate, their character repertoires are effectively identical. Computer systems are now able to create, store, display, process, and transmit textual data independent of hardware platforms, software applications, or language.Unicode characters

The first release of Unicode defined 65,536 unique characters in what is called the Basic Multilingual Plane. Unicode 4.0 (2003) supports 55 writing systems and 96,382 assigned characters (more than 70,000 of which are Chinese ideographs). To allow for expansion, some code ranges have been reserved for local character definitions (“Private Use Areas”) or for future extensions. In addition, 16 additional planes (each with 65,546 characters) have been defined. A complete listing of the Unicode characters and their character codes can be found in the Unicode code charts.

How is a character defined in Unicode? It is defined as the smallest unit of a written language that has semantic value. Unicode has been designed as system for encoding characters, not for determining how they are displayed. The typographical variations that result from fonts and font attributes (such as bold or italics) do not create new characters. Unicode characters

Every Unicode character has a unique character code, derived from its position in the Unicode code charts. Where practical, each character also has a unique descriptive name (some Chinese ideographs are not named). Character names can be translated, but character codes remain constant, permitting unambiguous machine processing of textual data. ASCII characters have been assigned the first 128 Unicode character codes, which enables the ASCII characters to be considered valid in Unicode.

Unicode characters are recorded in computer memory in the order in which they are keyed. This means that, regardless of whether the text is read right-to-left, left-to-right, or in columns, Unicode-compliant software should properly handle all scripts. Some scripts, such as Arabic and Hebrew, are bidirectional; characters (including punctuation) can be ordered from left-to-right and right-to-left in the same line. To help rendering systems produce correct visual presentations, Unicode has defined a complex bi-di algorithm to assure that Unicode-compliant software applications interpret character sequences the same way.

Unicode characters Unicode Character Encoding

Computers exchange and process data using bytes or sequences of bytes. But the number of bytes used in character codes may vary. Many pre-Unicode coded character sets (such as ASCII, ISO-8859-1, Windows-1252) can be expressed in 1-byte (8-bits). Unicode, with its thousands of character codes, requires at least 2-bytes (16-bit). So how are computer systems transitioning between the older 8-bit coded character sets and Unicode? Systems rely on mapping information known as character encoding forms, each of which uses a particular algorithm or character encoding scheme to convert 16- and 32-bit Unicode code values to sequences of one or more bytes.

Unicode has defined a small number of character encoding forms to make it easier for software applications to use or store Unicode character codes. The most commonly used forms are UTF-8 and UTF-16. UTF-8 transforms every Unicode character into a sequence of one to six 8-bit code values. ASCII characters remain encoded with single byte code values. All other characters are encoded with two or more bytes. Because UTF-8 expresses code values in 1-byte increments, pre-Unicode software applications that expect single byte ASCII code values will continue to work with UTF-8 data. And if our text contains mostly ASCII characters, using UTF-8 encoding saves computer storage space, because characters are stored with 8-bit rather than 16- or 32-bit code values. UTF-16, on the other hand, uses 16-bit (or two-byte) code values for every Unicode character. As a result, all 65,536 characters from the Basic Multilingual plane can be represented with a single code value—the same value found in the Unicode code charts.

Unicode characters Unicode Hardware and Software Support

Today, Unicode is the de facto character encoding standard for major computer hardware and software applications, a process that has required significant retooling efforts to integrate the longer 16- and 32-bit code values into system designs. Unicode support is now incorporated into many metadata structures and formats (including XML, HTML, MARC, EAD2002, OAI-PMH). Recent versions of standard operating systems (such as Windows, Mac OS, Solaris, AIX, Linux, Internet browsers (such as IE, Netscape, Mozilla, Opera), software applications (including Microsoft Office, X-Metal, Adobe type technology, Oracle, MySQL, Sybase, DB2), and programming languages (such as Java, Perl, C/C++, and Visual Basic) are gradually adding more and more Unicode support. Even text editors like Windows Notepad now support Unicode. Existing font sets are being updated and new ones created to enable Unicode-compliant data entry in various scripts, a process that entails mapping existing glyphs to Unicode character codes, adding glyphs for new character codes, floating diacritics where appropriate, and correctly interpreting bidirectionality in scripts.

But inconsistencies abound, presenting challenges for users. Because the Unicode standard is evolving, systems are compliant with different versions of Unicode—some with Unicode 3.0, others with version 3.2, others with the newest version 4.0. Some vendors prefer UTF-8; others prefer UTF-16. We must be careful that we use appropriate toolsets. Software routines built into our applications—that change our textual data “on the fly” from UTF-8 to UTF-16, and back again—work if our data is in Unicode. Data encoded using ASCII or ISO 8859-1, however, may not be automatically converted. We may need to rely on special application routines or we may need to preprocess our data using third-party transformation filters to convert our character encodings (from, for example, ISO 8859-1 or MARC-8 to UTF-8). And data transformation tools found in earlier versions of word processing software from vendors like Microsoft and Corel often leave little or no evidence that we used proprietary glyphs for non-ASCII characters when we “save as” a Unicode-compliant format such as XML.

Much work remains to be done. It may be difficult to tell which coded character set and character encoding form has been used with our data. And it is not always easy to distinguish between displays of precomposed and decomposed characters with diacritics for diagnostic purposes. Public domain fonts that comprehensively cover the Unicode character set are not yet available. Displays for bidirectional scripts are often problematic. However, we must remember that this is a transitional period for metadata generation and processing. Unicode is not yet a fully supported standard with well-documented toolkits. Much of our metadata has been created using character encodings other than Unicode, and we may face extensive data conversion projects. The hardware and software environment, however, is rapidly making significant strides in managing Unicode data.

Unicode characters Entering Unicode Characters in Our Applications

While typing characters on a computer keyboard may seem simple, in reality we are storing numeric values for each character. Wherever possible, we should use the tools provided by our document-production software to key non-ASCII Unicode characters into our text. Drop-down menus and toolbars integrated into our applications allow us to select Unicode characters from graphic displays, ensuring that the character encoding in our documents is consistent. Our computers may also have general tools, such as the Windows “Character Map” (found under the Accessories>Systems Tools menu) to help us enter character codes correctly. We must be careful to avoid introducing inconsistent character encoding forms when we cut and paste text from various sources into our documents.

If we need to enter Unicode characters modified by diacritics, we can key these characters separately: as a base alphabetic character followed by one or more combining characters (for example, “Latin small letter e” or “U+0065,” then “combining acute accent” or “U+0301”). Many common accented characters established in pre-Unicode character sets can also be entered using a single character code (for example, “Latin small letter e with acute” or “U+00E9”). These single codes, known as precomposed characters, have been incorporated into Unicode; however, no new precomposed character codes will be added to the standard. Four normalization rules have been defined in Unicode to help software applications identify character codes that should be treated as “equivalent” for searching, indexing, and sorting, as well as to help determine what glyphs should be displayed for each character code.

There are also circumstances when Unicode characters must be keyed using an ASCII representation of their character codes. We may need to key a reserved character (such as an ampersand) in our XML text, or we may need to use a character not available on our keyboards or present in our applications’ drop-down menus and toolbars. The usual manner for entering these ASCII code values is through a special notation practice known as an escape sequence, which adds a special prefix and suffix to the Unicode character code. In HTML and XML, these Unicode character codes should be entered using numeric character references, prefixed by an ampersand and a pound sign, and ending with a semicolon. For example: a “Latin small letter e with acute” can be keyed using the decimal numeric character reference or the hexadecimal numeric character reference to represent the Unicode character code “U+00E9.”

Decimal numeric character reference  
Hexadecimal numeric character reference

Unicode characters Unicode and XML

Many data formats used to describe library and archival content rely on XML schemas or DTDs to structure their metadata: MODS, MARC XML, Dublin Core, EAD2002 eadheader, VRA Core, TEI header, ONIX, CIMI/Spectrum. The following issues are important to keep in mind when we use XML to structure our data:

Encoding Declaration. The default encoding for XML is Unicode UTF-8 or UTF-16. If our parsable text uses a coded character set other than Unicode, our document’s XML declaration must include an encoding declaration that identifies this character set. In fact, including an encoding declaration is good form in general:

‹?xml version="1.0" encoding="UTF-8" standalone="no"?›
‹?xml version="1.0" encoding="ISO-8859-1" standalone="no"?›

XML 1.0 specifies that all XML parsers must be able to handle UTF-8 and UTF-16. However, when encoding information is not included in the XML declaration, parsers must use heuristics to decide whether UTF-8, UTF-16, ISO-8859-1, or another encoding has been used. The parser may generate a “fatal error” if the encoded characters in our document use a different character set or character encoding form than specified in our encoding declaration. Therefore, it is important that any textual data we cut and paste into our XML documents either apply the same encoding or be appropriately transformed when we save our documents. Some XML editors allow us to specify the document encoding when we save it. But if this is not an option, we can also alter a document’s encoding through an XSLT transformation, by using the element in an XSLT template.

Reserved Characters for XML Markup. XML reserves five characters to separate XML markup from textual data: & (ampersand), > (greater than), < (lesser than), > (apostrophe), and " (double quotation marks). If we use these reserved characters in parsed character data (or PCDATA), they must be escaped using either numeric character references or entity references.

For example, ampersands may be encoded as follows:
  and  

Reserved characters used in CDATA must be enclosed in markup and, if necessary, converted to entity references (numeric character references cannot be used in CDATA). To encode the phrase “2 & 4 are < 9,” for example, we would enter:

PCDATA PCDATA
CDATA CDATA


A limited number of control codes are considered valid XML characters, including the tab, carriage return, and line feed. When importing data containing these codes into XML documents, it is best to convert the control codes to XML markup.

Representing Non-ASCII Characters in XML. Whenever our XML data entry software supports entering non-ASCII characters from graphic menus or toolbars, we should do so. The W3C recommends that precomposed characters, where possible, represent characters modified by diacritics. We should use numeric character references for non-ASCII characters only when necessary. The W3C strongly discourages use of the character entity references found in HTML and SGML documents (such as “& eacute;”). While valid in XML, these entity references should be replaced using the graphic character displays or numeric character references, as appropriate. Character entity references are not permitted, for example, in OAI records and are not supported by all XML tools. As in SGML, character entity references, when used, must be defined. Unicode characters

Non-ASCII characters found in URIs should be converted to UTF-8, following the URI syntax defined by the IETF in RFC 2395. Each non-ASCII character should be represented in UTF-8 in an escaped form (%HH, where HH is the hexadecimal value) for each byte of the UTF-8-encoded character code. In the following example, the URL contains the word "Français", with the character “Latin small letter c with cedilla” (in Unicode: U+00E7; in UTF-8: x‘C3A7’). This URL should be encoded as:

http://directory.google.com/Top/World/Fran%C3%A7ais/

Any Unicode character valid for XML names can be used in XML element and attribute names (non-ASCII characters must, however, be represented using UTF-8 or UTF-16 encoding, not numeric character references). When non-ASCII characters are used in element and attribute names, we should expect that some XML transformation tools may not handle these characters gracefully. Numeric character references are also not allowed in processing instructions and XML comments. A small number of character set issues, including improved normalization of characters across Unicode versions and the ability to use recently defined Unicode characters in XML markup, will be resolved in XML Version 1.1.

Unicode charactersMARC 21 Records

MARC 21 metadata records are intended for broad, standardized exchange, using the ISO 2709 interchange format. To ensure interoperability, MARC 21 records must be encoded using either MARC-8 or UTF-8. If leader position 9 in a MARC 21 record contains a blank value (x‘20’), the record uses MARC-8; if the value is “a,” the record uses Unicode encoded in UTF-8.

MARC 21 Repertoire. At this time, the MARC 21 repertoire supports ASCII/ANSEL as the default MARC-8 character set. Basic Hebrew, basic and extended Cyrillic, basic and extended Arabic, Greek, EACC (East Asian Coded Character Set, ANSI/NISO Z39.64), and Canadian Aboriginal Syllabic scripts are also supported. With the approval of UTF-8 as an encoding option, MARC 21 records encoded with UTF-8 can now include Unicode characters not available in MARC-8. To enable round trip transformations between UTF-8 and MARC-8 with as little loss as possible, only a restricted subset of Unicode characters is currently included in the MARC 21 repertoire. This decision will be revisited periodically as the character encoding environment develops. All legacy data created prior to the approval of UTF-8 as an accepted MARC 21 encoding can be converted between MARC-8 and UTF-8 without data loss. Unicode characters

MARC 21 Repertoire, UTF-8, and Diacritics. The MARC 21 repertoire defines a set of diacritics valid for use in MARC 21 records. Precomposed characters are not used in this repertoire; all diacritics are entered separately from the alphabetic character they modify. If our MARC 21 record input software supports both MARC-8 and UTF-8, we should be cautious when entering data into a MARC 21 record from sources that use precomposed Unicode characters. For a small number of characters in non-Western languages, the standard Unicode character decomposition specifications do not map to valid MARC 21 characters. If we rely on standard Unicode decomposition software, our records may contain invalid characters.

The placement of diacritics differs between MARC-8 and UTF-8. Diacritics precede the character they modify in MARC-8 encoding. MARC 21 records encoded using UTF-8, however, follow the Unicode practice of placing diacritics after the character they modify. When transformation tools convert records between MARC-8 and UTF-8, some records may sort incorrectly because non-filing indicators in their title fields were set incorrectly.

MARC 21 Repertoire, UTF-8 and EACC Characters. With the adoption of UTF-8 encoding for MARC 21 records, the mapping of East Asian character codes in MARC 21 records was reexamined. The EACC character set used for Chinese, Japanese, and Korean records contained 258 characters that were not part of Unicode 3.2. To ensure that all characters used in MARC 21 records could be shared successfully across institutions, the MARC community decided not to represent these 258 EACC characters with Unicode character codes reserved for local use (“Private Use Area”), but instead to map the characters to similar Unicode character codes. These mappings are specified in Alternative Unicode Mappings for MARC 21 Characters Assigned to the Private Use Area (PUA).

MARC 21 and MARC XML. Traditional MARC 21 records (in the ISO 2709 format) can be transformed into XML using schemas such as the one maintained by the Library of Congress, the MARC XML schema. The Library of Congress also offers a downloadable MARC 21 to MARC XML transformation in its MARC toolkit (ZIP file) which uses the MARC-8 repertoire of characters in the conversion to Unicode, thus retaining the MARC 21 convention of separating diacritics from the characters they modify. Other transformations may use precomposed characters.

Unicode characters Conclusion

A single comprehensive universal character set—Unicode—in combination with the flexible framework for structuring and packaging metadata provided by XML gives libraries and other cultural institutions significant new tools for creating and processing metadata. Over the next few years, the limitations we currently face—in areas such as font support, input and rendering programs, diagnostic programs, and data conversion utilities—will be addressed by steadily improving hardware and software applications with a broad market base. Recognition of this trend should facilitate our adoption of Unicode, a standard that will ultimately ensure essential metadata interoperability, provide greater platform and vendor independence, reduce development costs, and improve metadata permanence.

Unicode characters Notes

[1] Unicode® Consortium is a registered trademark and Unicode™ is a trademark of Unicode, Inc. (back)


 ECURE Summary  Print this article only

ECURE Presents Diverse Views of Preservation and Access

Authors: Rob Spindler - Arizona State University (robert.spindler@asu.edu), Jeremy Rowe - Arizona State University (jeremy.rowe@asu.edu)

 

In March, Arizona State University hosted ECURE 2004, the 5th Preservation and Access for Electronic College and University Records (ECURE) conference. ECURE brought together archivists, technology administrators, records managers, registrars, faculty, librarians and university administrators to discuss the issues and potential solutions to retention and dissemination of the data, publications and other resources generated by university researchers. Interdisciplinary discussions revealed the wide variety of applications and information policies that impact present and future access to university research.

David Sobel, General Counsel for the Electronic Information Privacy Center, kicked off the event with an informative presentation on the status of the USA Patriot Act and its relationship to other government efforts to acquire personal information for anti-terrorist investigations. He noted that despite the efforts of EPIC and other concerned organizations, “We really don’t know more about the operation of the Patriot Act than we did a year and a half ago.” Sobel underscored the differences between information requests authorized by Section 215 of the Patriot Act, National Security Letters, and projects like the Transportation Security Administration’s CAPPSII program for screening airline passengers. He also noted places where the Patriot Act has nullified privacy protections of FERPA, the difficult balancing act that this forces upon universities when requests are received, and a specific example involving government efforts to acquire information about anti-war demonstrators at Drake University. “What we’re seeing is a trend of continuing to expand both the government’s authority to seek information and also to expand the categories of information that can be obtained.”

The following morning Clifford Lynch, Executive Director of the Coalition for Networked Information, offered his fifth ECURE keynote address over breakfast. Dr. Lynch focused on the increasing interdependence of teaching, research, and scholarly communication. Using learning management systems as an example, Lynch said “It’s getting very hard to tell what’s a record, what is research, and what is teaching and learning.” He also noted that international, inter-institutional and student/faculty collaborations are making ownership of intellectual property a much more complex issue than before, and that most universities are not addressing these “hard” issues.

Lynch continued by citing the recently approved policy of the National Institutes of Health that requires recipients of large grants to include a plan for preserving and disseminating research data. He noted that some scholarly publishers now require access to the raw data behind the research results as part of the agreement to publish in order to facilitate independent analysis and verification. Lynch also recognized that disciplinary repositories and scholarly associations often fund data retention through grants and even single bequests, but coordination across projects and long-term commitments to preserving research data and results were slowly evolving. Lynch noted, “Government funding is a dangerous thing to count on across long periods of time. How are we going to bring disciplinary content back into universities when we have funding failures elsewhere?”

Nineteen other speakers from universities and businesses across the United States and Canada reviewed a variety of research data management and dissemination issues including institutional repositories, metadata creation and management, digital signatures, federated databases, student privacy, Website preservation, and records management for learning management systems. Many of the sessions featured examples of direct collaborations between information professionals and research faculty. Presenters and attendees hailed from prestigious research universities such as Harvard, Yale, and Stanford.

Presentation slides, photographs and video streamed keynote presentations by Sobel and Lynch will be available at the ECURE 2004 site shortly (www.asu.edu/ecure). ECURE 2005 will be held in March at Arizona State University. The focus will once again be preservation and access to research data, products, and resources created by university research, and the Call for Papers is currently available on the ECURE Web site. For additional information contact the co-chairs: Robert Spindler (robert.spindler@asu.edu) or Jeremy Rowe (jeremy.rowe@asu.edu).


 Highlighted Web Site  Print this article only

LEADERS: Linking EAD to Electronically Retrievable Sources



Linking EAD to Electronically Retrievable Sources logoThe School of Library, Archive and Information Studies, University College London (UCL) has recently released the LEADERS Demonstration Application. The LEADERS project set out to enhance the structure and scope of information available to remote users of digitized archival material. Specifically the project team aimed to leverage XML standards and tools to bring together archival documents, finding aids, and authority records within a Web interface. This was accomplished by integrating three XML-based encoding standards: Encoded Archival Description (EAD), Encoded Archival Context (EAC) and The Text Encoding Initiative (TEI). EAD is an established standard for encoding archive finding aids. EAC is a newer standard for creating authority records containing information about the biographical or administrative histories of the creators of archive materials. TEI standardizes the encoding of electronic texts and enables search, sort and presentation features. Working together in a platform-independent Web environment, these three encoding systems present more powerful and flexible means for accessing and analyzing digital archive content. The Demonstration Application provides a working example of these integrated systems in action using a subset of UCL’s George Orwell Archive and University College London Archive.

The Website also provides a rich set of resources to support the project and anyone wishing to use a LEADERS system with an archival project including:

  • instructions for installing and configuring a LEADERS server system
  • technical specifications, including guidelines for EAD encoding, DTD schemas, and Tag Libraries
  • an end-user manual
  • data gathered from user surveys and focus groups
  • a gentle introduction to the role of XML tools in the archival process (FAQ)
  • an extensive reference list for related topics such as user needs, SGML/XML, and standards
  • links to project reports, papers and presentations.

 Calendar of Events  Print this article only





 

ERPANET Seminar on File Formats for Preservation
May 10–11, 2004
Austrian National Library, Wien
Co-sponsored by Austrian National Library (Österreichische Nationalbibliothek, ÖNB) and Digital Curation Centre (UK), the seminar will focus on file format and file format obsolescence issues.

Digitization for Cultural and Heritage Professionals Workshop
May 16–21, 2004
Chapel Hill, North Carolina
This weeklong course is the fourth offering of the Digitization for Cultural and Heritage Professionals workshops. The course will employ lectures, seminars, and practical exercises to help cultural heritage institutions gain a better understanding of issues to consider when developing digitization projects.

13th World Wide Web Conference
May 17–22, 2004
New York City, New York
Look for workshops focused on the Semantic Web, High Performance XML Processing, Content Labeling, and others. The International Conference on Autonomic Computing will also be held in conjunction with this conference.

NISO Workshop: Metadata Practices on the Cutting Edge
May 20, 2004
Washington, DC
This is a one-day meeting addressing theoretical and practical aspects of topics such as metadata syndication, digital archiving, metadata quality assurance, the Joint Working Party initiative on serials and subscription metadata, METS, and MODS.

Libraries in the Digital Age (LIDA) 2004
Dubrovnik and Mljet, Croatia
May 25–29, 2004
This year’s themes of the LIDA annual conference are: Human Information Behaviour and Competences for Digital Libraries. The first theme builds around issues of how people seek out and use information in the context of digital libraries. The second theme deals with identifying skill level and skill development, not only for digital library professionals and staff, but also for consumers of digital content.

Joint Conference on Digital Libraries (JCDL) 2004
June 7–11, 2004
Tucson, Arizona
The theme of JCDL 2004 is Global Reach and Diverse Impact, with highlights of the geographical, cultural, and technological reaches of digital libraries.

Electronic Media Group Meeting
June 9–14, 2004
Portland, Oregon
As a part of the American Institute for Conservation's 32nd annual meeting, the Electronic Media Group program activities will attempt to answer the central question: “Is progress being made in addressing the many preservation challenges posed by technology-driven cultural materials?” The program will highlight case studies, project results, panel discussions and other methods applied for the purpose of preserving electronic media including electronic art, multimedia, audiovisual materials, and computer and video games.

ACH/ALLC Pre-conference XML/XSLT Workshops
Gteborg, Sweden
June 9–11, 2004
Maryland Institute for Technology in the Humanities will offer two XML/XSLT pre-conference workshops as part of the 2004 Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities (ACH/ALLC). Participants may elect to register for Introduction to XML and the TEI or Introduction to XSLT.

The Digital Library and E-Publishing for Science, Technology, and Medicine
June 13–18, 2004
Geneva, Switzerland
This course, designed for Science, Technology, and Medicine librarians, will address issues and trends in electronic publishing, electronic journals, and digital libraries.

The Joint Technical Symposium (JTS)
June 24–26, 2004
Toronto, Canada
JTS, “the international gathering for all specialists of the audio-visual, cinema and sound heritage,” will be holding its sixth annual conference.

Digital Preservation Management: Short-Term Solutions to Long-Term Problems
July 19–23, 2004
Ithaca, NY
Registration will open in May for Cornell University Library’s summer offering of its digital preservation management workshop.


 Announcements  Print this article only





 

Yahoo!Search partners with digital library resources
As part of its new Content Acquisition Program, Yahoo! has teamed up with several digital library resources including: The New York Public Library, Project Gutenberg, University of Michigan's OAIster project, UCLA's Cuneiform Digital Library Initiative (CDLI), and the National Science Digital Library (NSDL), the National Science Foundation's online library. The Content Acquistion Program enables both non-commercial and commercial content providers to directly provide their Web pages, which are then added to Yahoo!'s search index in order to help improve search quality and expand the scope of content users can access.

The Open Archival Information System Reference Model: Introductory Guide
This guide, the first of a series of commissioned reports from the Digital Preservation Coalition (DPC) Technology Watch Service, is now available. Based in the UK, the aim of the DPC is to "foster action....to secure our global digital memory and knowledge base."

The Library of Congress Cataloging Directorate Webcast available
"Improving User Access to Library Catalog and Portal Information"
Presented by Dr. Marcia J. Bates, December 12, 2003. This presentation highlights recommendations on how to achieve enhanced access to and display of records for selected Web resources across multiple systems.

Searching the Internet for Images & Controlling Your Language: Links to Metadata Vocabularies
Two new public resources from the Technical Advisory Service for Images (TASI) in the UK.

Web-enabled PRONOM 3 is now available
The UK National Archives’ database of file formats and their supporting software is now available online. PRONOM 3 users can search by various fields such as file extension, product name, and vendor. Search results can be printed or saved in a XML or CSV file.

OCLC Research announces new ResearchWorks Web Site
This site showcases several current OCLC research projects. The site contains links to project pages, demos/prototypes, and project-related message forums. “Things to play with and think about.”

NSF Post Digital Library Futures Workshop Final Report
This report, based on an NSF workshop held June 2003 in Chatham, MA, highlights the critical need for continued research and development in digital library technology and expertise and describes priorities for the next phase of digital library research.

eScholarship Repository adds peer-reviewed publications
As part of its effort to address the high cost of traditional scholarly communication mode, the University of California Digital Library has expanded its eScholarship to include peer-reviewed journals from a wide range of disciplines.


 RLG News  Print this article only

New Look and Functionality for RLG DigiNews; Upcoming RLG Forum



New Look and Functionality for RLG DigiNews
Two years ago to this date, the editors of RLG DigiNews opened the April 15, 2002 issue with an article entitled  RLG DigiNews: Taking Stock at Five Years.  The article highlighted the evolution of this publication as it entered its sixth year. Two years later and entering the eighth year of publication, RLG DigiNews is once again changing the not only its look and feel, but also its functionality.

As most readers will notice, the entire RLG web site has changed, moving from static html pages to database-driven content maintained through a content management system.  Many institutions and organizations have gone through this kind of change. As RLG undertook this conversion, a key consideration was to preserve the integrity of finely-honed bookmark collections.  Rather than create a digital preservation problem, the result is a managed strategy to maintain access to past issues. 

New issues of RLG DigiNews will be published only within the new RLG web site — but older, legacy issues will be accessible through both systems.  So what does this mean for you?  For the foreseeable future, it means that RLG DigiNews issues will be available whether you access it via published citation, through an announcement of availability, or via the new web site.  The transition should be nearly invisible to users since besides preserving the access and links to issues, the original functionality, as well as the "look and feel" of issues will remain constant. 

Within this new environment, issues as well as single components of issues can be easily printed, following the links within each section of the issue.  For the time being, indices for new issues will be separate from those of legacy issues, but this will change in the next few months.  In the interim, pages provide links and ready access to the older issues that are used and cited often.

For future access, it would be best to to bookmark the new access point for RLG DigiNews: http://www.rlg.org/en/page.php?Page_ID=12081.  If you have any questions about this new RLG DigiNews site, please contact Robin Dale.

Upcoming RLG Forum in Europe - To Have and to Hold: Metadata and Institutional Repositories, 18 May, 2004
Following on the well-received To Have and To Hold: Metadata and Institutional Repositories forums held at the Library of Congress and the Chicago Historical Society in December 2003, and at Stanford University in April 2004, RLG has scheduled the final event in this series.

This one-day forum covers two interrelated topics pertinent to members and non-members alike: metadata and institutional digital repositories. Featuring RLG member experts and other speakers local to the venue, the forum serves as an educational opportunity for those desiring to learn more about how peer institutions are addressing the challenges related to long-term access to and preservation of digital materials.

On 18 May 2004, this forum will travel to Europe and be held in Den Haag, The Netherlands. Hosted by the Nationaal Archief, the final forum in this series will feature expert speakers from our member institutions based in Europe. Speakers and a full agenda can be found here. For more information, please contact Fran Devlin.


 Publishing Information  Print this article only





RLG DigiNews (ISSN 1093-5371) is a Web-based newsletter conceived by the RLG preservation community and developed to serve a broad readership around the world. It is produced by staff in the Department of Research, Cornell University Library, in consultation with RLG and is published six times a year at www.rlg.org.

Materials in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given to use material found here for research purposes or private study. When citing RLG DigiNews, include the article title and author referenced plus "RLG DigiNews." Any uses other than for research or private study require written permission from RLG and/or the author of the article. To receive this, and prior to using RLG DigiNews contents in any presentations or materials you share with others, please contact Jennifer Hartzell (jlh@notes.rlg.org), RLG Corporate Communications.

Please send comments and questions about this or other issues to the RLG DigiNews editors.

Co-Editors: Anne R. Kenney and Nancy Y. McGovern; Associate Editor: Robin Dale (RLG); Technical Researcher: Richard Entlich; Contributor: Ellie Buckley; Copy Editor: Martha Crowe; Production Coordinator: Carla DeMello.

All links in this issue were confirmed accurate as of April 12, 2004.




 
Home  |   About RLG   |  Projects  |  Products & Services  |  Publications  |  Support
Usage Statistics  |  Contact Us  |  About This Site  |  Copyright & Permissions  |  Site Map  |  © 2006 RLG
 
  About RLG home
  Mission & goals
  Members
  Board of directors
  Organization
  Events
  News
  Discussion lists
  Jobs
  Contact us
  Projects home
  Projects by goal
  Current projects
  Past work
  Guides & tools
  Working groups
  Products & services home
  Online databases
  Resource sharing & interlending
  Technical services
  Purchasing background
  Publications home
  Newsletters
  Symposium proceedings
  Books & reports
  Publications order form
  Support home
  Usage statistics
  Service schedules
  LI list
  Support contacts