RLG DigiNews
BROWSE ISSUES
SEARCH
RLG
   
RLG DigiNews Banner
  April 15, 2002, Volume 6, Number 2
ISSN 1093-5371       

 

 

RLG DigiNews: Taking Stock at Five Years
Cornell Editorial Staff

preservation@cornell.edu

Introduction
Five years ago, RLG published the first issue of RLG DigiNews. A lot has changed since then—and a good bit has remained the same. We're using this anniversary issue as a case study to reflect on those changes. This feature article discusses key turning points for RLG DigiNews from the access and preservation perspectives. Our FAQ asks "where are they now" as it follows up on two projects that were announced in the first issue. In the June 2002 issue, we'll report on several more. The fate of these projects, like the other changes that the editorial staff of RLG DigiNews has witnessed, are revealing of both the opportunities and the obstacles that line the shores of a swiftly moving technological sea.

RLG DigiNews had its roots in an RLG electronic group-based document, "Diginotes," compiled by members of PRESERV as a way to keep pace with the rapidly developing field of digitization. In the two "issues" distributed via email to a special RLG discussion list, "Diginotes" contained announcements on, and citations to, "library imaging technology and applications." Though "Diginotes" ceased after two compilations, the need for timely information on the topic of digitization did not.

Responding to member requests for assistance and information, a new Web-based document (RLG DigiNews) was born in early 1997. RLG joined forces with the staff of the Cornell University Library Department of Preservation and Conservation to provide a "substantive, informative, and timely response to the expressed desire of preservation specialists for an easy-to-understand, broadly conceived information stream on selected worldwide efforts in the converging fields of preservation and digitization." The editors promised to capitalize on the "enhanced functionality of a Web-based publication, by providing hot links to featured documents and enhanced searching capability." The publication has matured considerably in the past five years, as reflected in the changing masthead (figure 1) and the list of publishing milestones presented below.

1997 RLG DigiNews masthead 1998 RLG DigiNews masthead
1999 RLG DigiNews masthead 2000-2001 RLG DigiNews masthead
2002 RLG DigiNews masthead


Figure 1: The masthead has evolved to reflect the changing look of the Web


Publishing Milestones

  • RLG DigiNews began as a quarterly newsletter, but became a bimonthly publication in its second year. The Council on Library and Information Resources provided initial support for the two extra issues each year.
  • One FAQ and at least one Highlighted Web Site (HWS) have appeared in every issue. In recognition of the increasing popularity of FAQs, we moved this feature beginning with the August 1999 issue from its less obvious location in the midst of news and announcements to a more prominent place after the HWS.
  • Technical Reviews by editorial staff appeared only in Volume 1. These were replaced by periodic Technical Features, for the most part written by external authors.
  • The first Conference Report appeared in the December 1999 issue, reflecting the increasing importance and regularity of key meetings on digital imaging and preservation.
  • Beginning in 2000, each issue included coverage of some aspect of digital preservation. Relevant articles and items were flagged by a new icon that incorporated the infinity symbol typically associated with preservation, e.g., denoting the use of permanent/durable paper.
  • The Editor's Interview, introduced in August 2000, has provided an opportunity for focused discussions with key people on current hot topics.
  • For the first issue, searching was limited to use of a Web browser's "Find" or "Search" function, but by the second issue, viewers could browse the tables of contents or use keyword searching. Author and title indexes were subsequently added.


Access and Use

The initial intended audience for RLG DigiNews included "managers of digital initiatives with a preservation component or rationale." Since then, the reader base has grown dramatically, with the number of hits more than tripling from 1997 to 2000, from just over 20,000 hits to over 70,000. RLG reports that this publication is one of its most popular electronic resources. Lars Aronsson's Telecom History Timeline mentions the founding of RLG DigiNews as a historic event in 1997. The publication meets the Americans with Disabilities Act requirements for accessibility.

Each issue attracts thousands of readers on five continents. Back issues have a long shelf life. Usage of many early issues has not diminished substantially over time, and in some cases has increased. The October 15, 1998 issue, for instance, had more hits in 2001 than it did in 1999. Some issues remain significantly more attractive to users than others. The two most popular issues of 1999 featured lead articles on digital imaging and preservation microfilm (February) and digitization costs (October), proof that both topics continue to spark interest.

A Google index search identified over 1,000 links from various Web sites to RLG DigiNews. The publication shows up frequently on resource pages of consultants and faculty for digital preservation, digital imaging, and library conservation. Library portals around the world, including those in Australia, Canada, China, Europe, Israel, New Zealand, South Africa, South America and the United Kingdom, link to the publication. Twenty-one features have been highlighted in Current Cites, which each month selects "only the best items to annotate" from the current literature in information technology in print and digital form. Preservation "safekeeping" arrangements for many RLG DigiNews articles have been made with Australia's PADI, a subject gateway to select digital preservation resources. (see below)

 

map showing world readership of RLG DigiNews

Figure 2: A map showing the world readership of RLG DigiNews. The majority of readers come from North America (74%) and Europe (19.5%). Click on the pin click here to view map showing world readership of RLG DigiNews to see a larger map.

Chart showing number of visitors coming from various organizations

Figure 3: This chart shows the number and percentage of visitors coming from various Web domains, such as .COM, .NET,.ORG and .EDU for a three month period.

graph tracking first time visitors

Figure 4: Tracking First Time Visitors for three months reveals a peak in readership at the point of publication.

Preservation

While the access statistics for RLG DigiNews are gratifying, we were also interested in determining the health of the journal itself, especially since so many of the back issues continue to receive high use. The five-year anniversary offered a convenient milestone for reviewing our plans for long-term access to the content. Cornell staff prepared a list of key preservation considerations as a basis for self-examination and for identifying potential risk factors for the publication. Using these as a guide, the editorial staff at Cornell and RLG sought to assess the publication's preservation readiness.

1. Organizational commitment

  • What is RLG's commitment to maintain and continue the publication?
  • What is the funding stream; for how long is it secure?
  • Does RLG have a preservation strategy/plan in place?
Robin Dale, Program Officer, Member Programs and Initiatives, Research Libraries Group : RLG intends to maintain and continue the publication for as long as it continues to be a valuable resource to the community. The funding stream is a line item within the yearly budget and is continued from year-to-year. It is anticipated that this stream will remain in place for as long as RLG continues the publication. Regarding a preservation strategy/plan, we have several procedures in place. I make regular backup copies of the material to at least two different media, each of which is stored in a different physical location. The copies are regularly refreshed. This is in addition to the technical infrastructure and back-up described in question 2 below and the third-party archival arrangements described in question 9. Finally, RLG DigiNews issues, along with other selected RLG publications, are part of a testbed digital archive project currently underway at RLG. With all of these strategies in place, we feel comfortable with the security of RLG DigiNews.

2. Technical infrastructure

  • Where do the bits reside?
  • What kind of hardware/software/server is used?
  • Is it backed up, 24/7 supported?
Robin Dale: RLG DigiNews is an integrated part of the RLG corporate web site content. With a 1.544 Mbs connection to our Genuity internet service provider, RLG's corporate web server, named Lyra, sits on a 1.5 Mbs LAN. Presently, the server is a SUN UltraSparc 2, running the Solaris 2.5.1 (Unix) operating system. This is a single-CPU rated at 400 Mhz speed, with 250 Mb internal memory and 20 Gb external disk storage. It runs 24 X 7 X 365, and is continuously monitored for availability. The system and data are fully backed up weekly, using Veritas NetBackup, involving a rotating four-cycle process and off-site tape storage. The server platform will soon be upgraded in connection with an overall corporate website make-over project, and will be run on a SUN 220R server, which is a dual-CPU rated at 440 Mhz each, with 2 Gb of internal memory and 40 Gb of external disk storage. The new Lyra corporate web server will be running the Sun Solaris 2.8 (Unix) operating system. Trained RLG and Stanford University technical staffs are on call at all hours to ensure fast response to any special system needs. Vendor support from SUN is designated as "Silver", meaning we have contracted for "within four hours" on-site response for 8 hours daily on weekdays, with 24-hour support at all times by telephone.

3. Data Fixity

  • What means are in place to secure the files and protect them from unauthorized change and use, data corruption, etc?
Robin Dale: The Lyra server is a bastion host, meaning it must be accessible to the
general Web-using public. Because of this, it is also locked down from an information security standpoint. Access to other servers is tightly controlled, as well as the capability to use the Lyra server as a "Trojan horse" to access other sites. All Lyra server changes are made by system administrators who are permitted differing degrees of capability under user identification and password protection. The system also runs PERL 5.6 and a Web Indexer. Other software in support of the RLG corporate web site on the Lyra server includes Digital Certificate Issuing software for credit card transactions, a Virtual Web Server for access to detailed customer accounting reports under customer id and password control, an anonymous FTP server and a POP Mail server. The system uses the Apache web server software. Upon installation, the Lyra server was configured with the RLG-customized "security hardening" kit, which includes software such as YASP, tripwire-like features, and monitoring software to ensure the security of the server environment. The server platform and all network connections enjoy the physical security of being located in Forysthe Hall on the Stanford University campus, having very secure physical access and operational integrity characteristics.

4. Format stability, reliability, and complexity

  • What formats are used? What versions?
  • Do they adhere to common, open standards?
  • Is the coding correct and the data validated (e.g., HTML validator, parser)?
  • Does the journal rely on experimental approaches (e.g., technology that may be very short-lived)?
  • Is format control exercised by the editorial staff (E.g., does the staff do the mark-up or do contributors, does the journal establish and maintain format requirements)?
  • How complex are the formats in terms of variations, computation, volume?
Cornell editorial staff: RLG DigiNews currently uses GIF and JPEG for images and the current version of HTML for text markup. Current RLG DigiNews manuscript submission policy expands the acceptable article submission formats from ASCII to include Microsoft Word and RTF (Rich Text Format).. Current versions are used. We adhere to open common standards. Mark-up by the editorial staff is consistent and follows established standards. The RLG staff adheres to established procedures for validating the content before posting each issue. The journal tracks trends in Web site design and management, but uses technology that is readily available to avoid inhibiting use or maintenance. The staff does the mark-up using format requirements that adhere to RLG requirements and good practice. The formats used are not very complex. The content is intended for easy use in international settings.

5. Authenticity and Provenance

  • What is the policy on correcting mistakes? Is the original version maintained or the changes noted?
The Editorial Policy for RLG DigiNews is as follows: "Upon discovery and notification to RLG, the error is corrected and a note is inserted into the text to explain the reason why the text was corrected, as well as the date of the correction."

6. Redundancy

  • Is the publication mirrored? If so, where and in how many places?
  • Is there a formal agreement for mirroring in the works?
Robin Dale: The publication is not currently mirrored though we are considering some possible arrangements. Discussions with specific institutions are in the preliminary phase, though implementation of any agreement probably won't take place until at least the end of 2002.


7. Metadata

  • Technical (are the technical approaches well documented, e.g. use of javascript, the guts of the technical application, dependency on external programs and scripts, documentation on changes):
Cornell editorial staff: Yes, the source code includes scripts and these are well-documented. We document changes in policy and practice.
  • Navigation (nature and extent of descriptive and structural metadata, e.g., SGML, Table of contents, consistency of approach, etc.):
RLG DigiNews uses Dublin Core metadata elements and keywords for descriptive metadata. Each issue contains an embedded table of contents to help users navigate through the content. The mark-up is consistent from issue to issue, and changes in the structure and presentation of the content are noted at the time they are implemented.
  • Resource discovery: How can people find the journal?
  • What search engines and abstracting services pick them up?
  • What practices promote/inhibit resource recovery (e.g., use of metatags)?
  • Does the journal provide indexing/searching features itself?
Each issue of the journal is announced on major professional electronic mailing lists. RLG DigiNews is actively promoted by RLG and features prominently on the RLG Web site. As noted earlier RLG DigiNews is well-represented on institutional, organizational, and personal web sites devoted to digital imaging and preservation information. RLG DigiNews uses META tags for Dublin Core data elements, for keywords, and for high-level content elements. Consistent and correct formatting also promotes resource discovery. The RLG DigiNews site provides Author and Title indices, as well as links to back issues and basic full-text search capability. RLG also permits crawling of its site by major search engines, including Google, to facilitate resource discovery by users.

8. External dependencies

  • Does the publication use consistent/persistent link naming?
  • What's the nature and extent of dead links?
  • Are links really dead or just moved?
  • What policy does the publication follow, if any, when including external links, e.g., linking priorities, extent of monitoring and updating?
Cornell editorial staff: RLG DigiNews does not use any of the persistent link approaches. Each issue is a single document from which individual articles can be printed, so each article does not have a unique identifier. The naming of issues is consistent. RLG DigiNews incorporates many links into every issue, including the Highlighted Web Site (HWS). All of the back issues of RLG DigiNews are available on the RLG site and the staff uses link analyzers to monitor the site, though past links are not corrected if the sites are moved or removed. For instance, of the 34 HWS in the first five volumes of RLG DigiNews, 27 are still active and seven have moved. Of those seven, two have merged into one site, one has linked to a new site through a redirect, and one link has become corrupted.

The sidebar by Richard Entlich discusses the issue of link integrity.

9. External, third party archives

  • What external archives cover the journal and how complete is the coverage; to what extent could the journal be recreated from these archives?

Robin Dale: The Internet Archive can capture pages and sites that might not be saved otherwise and may be a piece of a retention program but should not be viewed as a substitute for a digital preservation program for RLG DigiNews. The Internet Archive holds copies of all issues of RLG DigiNews with the exception of the last, Volume 6, Number 1, 15 February 2002. This is because the most recent crawl of the site was on 7 February 2002. Since crawls of the site tend to take place every 8 to 9 weeks or sometimes longer, I'd imagine that this "missing issue" will be covered soon. Other than that, the journal could be recreated.

Cornell Staff Note: The Internet Archive recently launched the "Wayback Machine," an online tool to search the archives's vast holdings. When we used the Wayback Machine to search for RLG DigiNews we discovered some interesting results, as indicated by the table below. Note that the first copy listed for the April 1997 issue was from May of that year, while the first copy obtained for the RLG DigiNews home page was from December 1998, a year-and-a-half after the journal was first published.

  First Capture
RLG Diginews home page Dec 2, 1998
April 1997 issue May 3, 1997
April 1998 issue Aug 15, 2000
April 1999 issue Dec 9, 2000
April 2000 issue Aug 17, 2000
April 2001 issue Apr 18, 2001


Table 1. The Wayback Machine capture dates for RLG DigiNews issues


Besides the Internet Archive, the National Library of Australia's Preserving Access to Digital Information (PADI) initiative has established the Safekept program. Being selected for the Safekept program provides an opportunity for organizations to establish, review and/or enhance their Web site preservation programs. The SafeKept program identifies a nine-step program for insuring the preservation of the selected Web sites, to which contributing organizations must adhere. The Safekept program has many, but not all issues of RLG DigiNews marked within its databases. This "minimalist" approach was borne out of an agreement with PADI that RLG would provide for/maintain all of RLG DigiNews in many ways and formats and therefore it wasn't absolutely necessary for NLA to do the same.

10. Look and Feel

  • Is the old design and any functionality of early issues maintained (e.g., interface changes)?
Cornell editorial staff: Yes. The back issues include the masthead that was in use when the issue was published. Users can view changes in the journal's presentation and format. Earlier we noted the various changes and introductions to the journal.

11. Virtual content

  • How much of the content is virtual (e.g., created at the point of access, or generated on-the-fly)? How well can this virtual content be maintained?
Cornell editorial staff: The content of RLG DigiNews is captured in static HTML documents. Recent issues contain a script that monitors use of the site, but there is no virtual content to be maintained.

12. Ability to retain extended (added-value) services of the journal

  • Is preserving its function as a ready reference database or other information services supported?
Robin Dale: The functionality of RLG DigiNews is easy to maintain. The search capability is upgraded periodically to continue to provide basic searching using current technology. Features that provide access to the content are based primarily upon links that support navigation between issues and identifying topics and authors of interest.

What do our readers think?
In our February 2002 issue, we included a readers' survey to help guide our planning for the next five years of RLG DigiNews. Responses came from 233 readers, including 66 written comments. Overall, we received an abundance of valuable information, for which we thank everyone who offered their time and thoughtful feedback.

RLG DigiNews is the single most valuable source of information that we have. Its current focus is right on target. a quote by nancy kushigian


We were pleased to discover that 87 percent of respondents claimed that the content of RLG DigiNews is "just about right" in its usual level of technical detail. We found it interesting that two-thirds of respondents favor digital imaging features over other parts of the journal. However, respondents were split on the type of content they prefer; 39 percent favor policy recommendations, while 28 percent prefer information on technical standards and best practices, and 27 percent favor equipment reviews. Preferences may be changing, as we found that the number of new readers (those who have discovered the journal within the last two years) were slightly more numerous than responses from long-time readers.

We were also interested to learn that 60 percent of respondents discovered RLG DigiNews through listservs, while only 23 percent learned about it from colleagues and just 10 percent through other publications and Web sites. This is no doubt due to posting announcements of the release of new issues on 20 listservs worldwide.

In sum, as we begin our sixth year of publication, the state of RLG DigiNews appears healthy. Although we can't forecast technological advances and readers' interests for 2008, we can expect at least as many changes as the past five years have witnessed. The editorial staff will maintain flexibility, will periodically take the pulse of our readers, and will look forward to writing the 10th anniversary article!

Link Analysis in RLG DigiNews
Richard Entlich

Once an issue of RLG DigiNews goes "to press" its content is considered fixed. Subsequent changes are only made to correct significant errors, and those are always documented within the issue. Although such a policy guarantees the editorial integrity of the publication, it also means that, over time, the links to external Web sites will gradually degrade.

How bad a problem is "link rot?" Pretty bad, as any regular Web surfer will quickly tell you. Not only does Web site content change and move around a lot, but domain names lapse and are reassigned on a regular basis.


We conducted an analysis of links to external Web sites from the first five years of RLG DigiNews issues. The results are shown below. When all the issues for each year of publication are averaged together, there is a nearly linear increase in the annual percentage of bad links, from about 10% in our most year of publication (2001) up to about 40% in our first year (1997).


A graph showing external link status from 1997 to 2001

There were a total of 1236 unique external links in the 28 issues of RLG DigiNews published prior to this issue. How did we determine the validity of so many links? For the most part, we did what most people would do when faced with such a large task: we took advantage of automation and employed a software link checker. Writing a link checker must either be a very popular assignment in computer science classes or else part of a rite of passage into the world of open source computing, because there are dozens of link checkers available. Many are freeware or shareware or free online services. A good list of available products and services (with a few of its own bad links) is available at http://www.elsop.com/wrc/comp_ls.htm.


As one might expect with such an abundance of products on the market, there are some substantial differences in features and performance. We tested only a few products, but found some considerably more flexible and powerful than others. The differences have an impact not only on ease of use and reporting capability, but on the validity of the results.


The operational basis for link checkers is fairly straightforward. Much like a browser, a link checker connects to Web sites using http (hypertext transfer protocol) and gathers data from the server. However, instead of rendering the data into a viewable Web page, it extracts status information that is part of every http exchange, but not always displayed to the end user. Most everyone has seen a message reading "404 Not Found" when trying to access a URL that's no longer available. The 404 is one of dozens of http status codes defined in the protocol. A "successful" http transaction usually produces a status code such as 200 (meaning "OK") or perhaps 301 or 302 (for a redirect to a new location). Status codes for properly completed http transactions are not normally displayed to users.

How accurate a picture of a Web site's link status does a simple compilation of http status codes provide? That depends a lot on one's definition of a "good"' or "bad" link. For our purposes, we defined a good link as one that leads to (roughly) the content intended by the original reference, either directly or through an automated redirection or refresh. We even accepted situations where one additional manual click would be required to find the original content, as long as the path to that material was fairly evident on the first page brought up.

Unfortunately, even given this fairly liberal definition of "good," most link checkers relying on simple status codes will significantly overestimate the number of good links on a site. There are many situations in which a site that appears good to a link checker may in fact fall short of our definition of good.
  • Status code does not equal content. Just because a URL is still good doesn't mean the content hasn't changed dramatically. The domain name registration may have lapsed and been purchased by another entity. There may be no connection whatsoever with the original content or content provider.
  • Putting out mixed messages. While displaying html content that clearly says "bad," the status code sent out via http says "good." This happens most frequently with sites that have substituted a custom error message for missing pages and failed to associate the proper status code with it. It's obvious on manual examination that these pages are bad, but without special effort, a link checker won't detect it. Depending on how they obtain their status codes, different link checkers may vary in their reporting on such sites.
  • Moved or lost in transit? There are various ways for sites to indicate that content has moved to a new location. One method (called a meta-refresh) is usually reported as good as long as the page with the "we've moved" message loaded correctly. There are also different kinds of redirects. Redirects produce their own status codes, so at least one is alerted that something may be amiss, but a link checker may assume that as long as the redirect succeeds, the page is good. Frequently this is not the case. Also, since assignment of redirect codes requires manual intervention by site operators, it is prone to misapplication.
The figures in the graph above were not taken at face value from a link checker. We did some manual checking of the results in order to verify their accuracy. We manually checked all reported codes for permanent or temporary redirections. We found that 25% of the permanent redirections and nearly 30% of the temporary redirections did not pass our definition of good. We also checked certain other reported server problems. We adjusted the results in the graph to reflect these findings.

We did not manually check all the links reported as good by the link checker, but we did test a subset, just to get an idea of the error rate. Of the 68 links reported good in the three issues of RLG DigiNews for 1997, 14 (a little over 20%) did not pass muster. This discrepancy is not accounted for in the graph above, so, according to our definitions, the percentages of bad links in RLG DigiNews it shows should be higher.

Despite these limitations, however, link checkers can form an essential part of a Web maintenance program, whether for your own site, or an external site that contains resources important to you. But it is important to understand that link check reports cannot necessarily be taken at face value. More expensive link checkers provide configuration and customization options that can help produce more accurate status assessments. Given the abundance of available link checkers, our advice is to take advantage of freeware as well as trial versions of shareware and commercial products to find the product that meets your needs.

—Peter Botticelli, Robin Dale, Carla DeMello, Barbara Berger Eden, Richard Entlich, Anne R. Kenney, and Nancy McGovernpublishing information

Publishing Information

RLG DigiNews (ISSN 1093-5371) is a newsletter conceived by the members of the Research Libraries Group's PRESERV community. Funded in part by the Council on Library and Information Resources (CLIR) 1998-2000, it is available internationally via the RLG PRESERV Web site (http://www.rlg.org/preserv/). It will be published six times in 2001. Materials contained in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given for the material in RLG DigiNews to be used for research purposes or private study. RLG asks that you observe the following conditions: Please cite the individual author and RLG DigiNews (please cite URL of the article) when using the material; please contact Jennifer Hartzell, RLG Corporate Communications, when citing RLG DigiNews.


Any use other than for research or private study of these materials requires prior written authorization from RLG, Inc. and/or the author of the article.


RLG DigiNews
is produced for the Research Libraries Group, Inc. (RLG) by the staff of the Department of Preservation and Conservation, Cornell University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern; Production Editor, Barbara Berger Eden; Associate Editor, Robin Dale (RLG); Technical Researchers, Richard Entlich and Peter Botticelli; Technical Coordinator, Carla DeMello.


All links in this issue were confirmed accurate as of April 12, 2002.


Please send your comments and questions to preservation@cornell.edu.

end of issue

   
 
RLG DigiNews
BROWSE ISSUES
SEARCH
RLG