|
|
RLG
DigiNews:
Taking Stock at Five Years
Cornell
Editorial Staff
preservation@cornell.edu
Introduction
Five years ago, RLG published the first issue of RLG DigiNews.
A lot has changed since thenand a good bit has remained the same.
We're using this anniversary issue as a case study to reflect on those
changes. This feature article discusses key turning points for RLG
DigiNews from the access and preservation perspectives. Our FAQ asks
"where are they now" as it follows up on two projects that were
announced in the first issue. In the June 2002 issue, we'll report on
several more. The fate of these projects, like the other changes that
the editorial staff of RLG DigiNews has witnessed, are revealing
of both the opportunities and the obstacles that line the shores of a
swiftly moving technological sea.
RLG DigiNews had its roots in an RLG electronic group-based document,
"Diginotes," compiled by members of PRESERV as a way to keep
pace with the rapidly developing field of digitization. In the two "issues"
distributed via email to a special RLG discussion list, "Diginotes"
contained announcements on, and citations to, "library imaging technology
and applications." Though "Diginotes" ceased after two
compilations, the need for timely information on the topic of digitization
did not.
Responding to member requests for assistance and information, a new Web-based
document (RLG DigiNews) was born in early 1997. RLG joined forces
with the staff of the Cornell University Library Department of Preservation
and Conservation to provide a "substantive, informative, and timely
response to the expressed desire of preservation specialists for an easy-to-understand,
broadly conceived information stream on selected worldwide efforts in
the converging fields of preservation and digitization." The editors
promised to capitalize on the "enhanced functionality of a Web-based
publication, by providing hot links to featured documents and enhanced
searching capability." The publication has matured considerably in
the past five years, as reflected in the changing masthead (figure 1)
and the list of publishing milestones presented below.
Figure 1: The masthead has evolved to reflect the changing look of the
Web
Publishing Milestones
- RLG
DigiNews began as a quarterly newsletter, but became a bimonthly
publication in its second year. The Council on Library and Information
Resources provided initial support for the two extra issues each year.
- One FAQ
and at least one Highlighted Web Site (HWS) have appeared in every issue.
In recognition of the increasing popularity of FAQs, we moved this feature
beginning with the August 1999 issue from its less obvious location
in the midst of news and announcements to a more prominent place after
the HWS.
- Technical
Reviews by editorial staff appeared only in Volume 1. These were replaced
by periodic Technical Features, for the most part written by external
authors.
- The first
Conference Report appeared in the December 1999 issue, reflecting the
increasing importance and regularity of key meetings on digital imaging
and preservation.
- Beginning
in 2000, each issue included coverage of some aspect of digital preservation.
Relevant articles and items were flagged by a new
icon that incorporated the infinity symbol typically associated with
preservation, e.g., denoting the use of permanent/durable paper.
- The Editor's
Interview, introduced in August 2000, has provided an opportunity for
focused discussions with key people on current hot topics.
- For the
first issue, searching was limited to use of a Web browser's "Find"
or "Search" function, but by the second issue, viewers could
browse the tables of contents or use keyword searching. Author and title
indexes were subsequently added.
Access and Use
The
initial intended audience for RLG DigiNews included "managers
of digital initiatives with a preservation component or rationale."
Since then, the reader base has grown dramatically, with the number of
hits more than tripling from 1997 to 2000, from just over 20,000 hits
to over 70,000. RLG reports that this publication is one of its most popular
electronic resources. Lars
Aronsson's Telecom History Timeline mentions the founding of RLG
DigiNews as a historic event in 1997. The publication meets the Americans
with Disabilities Act requirements for accessibility.
Each issue attracts thousands of readers on five continents. Back issues
have a long shelf life. Usage of many early issues has not diminished
substantially over time, and in some cases has increased. The October
15, 1998 issue, for instance, had more hits in 2001 than it did in 1999.
Some issues remain significantly more attractive to users than others.
The two most popular issues of 1999 featured lead articles on digital
imaging and preservation microfilm (February) and digitization costs (October),
proof that both topics continue to spark interest.
A Google index search identified over 1,000 links from various Web sites
to RLG DigiNews. The publication shows up frequently on resource
pages of consultants and faculty for digital preservation, digital imaging,
and library conservation. Library portals around the world, including
those in Australia, Canada, China, Europe, Israel, New Zealand, South
Africa, South America and the United Kingdom, link to the publication.
Twenty-one features have been highlighted in Current
Cites, which each month selects "only the best items to annotate"
from the current literature in information technology in print and digital
form. Preservation "safekeeping" arrangements for many RLG
DigiNews articles have been made with Australia's PADI,
a subject gateway to select digital preservation resources. (see below)

Figure 2: A
map showing the world readership of RLG DigiNews. The majority of
readers come from North America (74%) and Europe (19.5%).
Click on the pin
to see a larger map.

Figure
3: This chart shows the number and percentage of visitors coming from various
Web domains, such as .COM, .NET,.ORG and .EDU for a three month period.

Figure
4: Tracking First Time Visitors for three months reveals a peak in readership
at the point of publication.
Preservation
While the access statistics for RLG DigiNews are gratifying, we were
also interested in determining the health of the journal itself, especially
since so many of the back issues continue to receive high use. The five-year
anniversary offered a convenient milestone for reviewing our plans for long-term
access to the content. Cornell staff prepared a list of key preservation
considerations as a basis for self-examination and for identifying potential
risk factors for the publication. Using these as a guide, the editorial
staff at Cornell and RLG sought to assess the publication's preservation
readiness.
1. Organizational
commitment
- What
is RLG's commitment to maintain and continue the publication?
- What
is the funding stream; for how long is it secure?
- Does
RLG have a preservation strategy/plan in place?
Robin Dale,
Program Officer, Member Programs and Initiatives, Research Libraries Group
: RLG intends to maintain and continue the publication for as long as
it continues to be a valuable resource to the community. The funding stream
is a line item within the yearly budget and is continued from year-to-year.
It is anticipated that this stream will remain in place for as long as RLG
continues the publication. Regarding a preservation strategy/plan, we have
several procedures in place. I make regular backup copies of the material
to at least two different media, each of which is stored in a different
physical location. The copies are regularly refreshed. This is in addition
to the technical infrastructure and back-up described in question 2 below
and the third-party archival arrangements described in question 9. Finally,
RLG DigiNews issues, along with other selected RLG publications,
are part of a testbed digital archive project currently underway at RLG.
With all of these strategies in place, we feel comfortable with the security
of RLG DigiNews.
2. Technical
infrastructure
- Where
do the bits reside?
- What
kind of hardware/software/server is used?
- Is
it backed up, 24/7 supported?
Robin Dale:
RLG DigiNews is an integrated part of the RLG corporate web site
content. With a 1.544 Mbs connection to our Genuity internet service provider,
RLG's corporate web server, named Lyra, sits on a 1.5 Mbs LAN. Presently,
the server is a SUN UltraSparc 2, running the Solaris 2.5.1 (Unix) operating
system. This is a single-CPU rated at 400 Mhz speed, with 250 Mb internal
memory and 20 Gb external disk storage. It runs 24 X 7 X 365, and is continuously
monitored for availability. The system and data are fully backed up weekly,
using Veritas NetBackup, involving a rotating four-cycle process and off-site
tape storage. The server platform will soon be upgraded in connection with
an overall corporate website make-over project, and will be run on a SUN
220R server, which is a dual-CPU rated at 440 Mhz each, with 2 Gb of internal
memory and 40 Gb of external disk storage. The new Lyra corporate web server
will be running the Sun Solaris 2.8 (Unix) operating system. Trained RLG
and Stanford University technical staffs are on call at all hours to ensure
fast response to any special system needs. Vendor support from SUN is designated
as "Silver", meaning we have contracted for "within four
hours" on-site response for 8 hours daily on weekdays, with 24-hour
support at all times by telephone.
3. Data
Fixity
- What
means are in place to secure the files and protect them from unauthorized
change and use, data corruption, etc?
Robin Dale:
The Lyra server is a bastion host, meaning it must be accessible to the
general Web-using public. Because of this, it is also locked down from an
information security standpoint. Access to other servers is tightly controlled,
as well as the capability to use the Lyra server as a "Trojan horse"
to access other sites. All Lyra server changes are made by system administrators
who are permitted differing degrees of capability under user identification
and password protection. The system also runs PERL 5.6 and a Web Indexer.
Other software in support of the RLG corporate web site on the Lyra server
includes Digital Certificate Issuing software for credit card transactions,
a Virtual Web Server for access to detailed customer accounting reports
under customer id and password control, an anonymous FTP server and a POP
Mail server. The system uses the Apache web server software. Upon installation,
the Lyra server was configured with the RLG-customized "security hardening"
kit, which includes software such as YASP, tripwire-like features, and monitoring
software to ensure the security of the server environment. The server platform
and all network connections enjoy the physical security of being located
in Forysthe Hall on the Stanford University campus, having very secure physical
access and operational integrity characteristics.
4. Format
stability, reliability, and complexity
- What
formats are used? What versions?
- Do
they adhere to common, open standards?
- Is
the coding correct and the data validated (e.g., HTML validator, parser)?
- Does
the journal rely on experimental approaches (e.g., technology that may
be very short-lived)?
- Is
format control exercised by the editorial staff (E.g., does the staff
do the mark-up or do contributors, does the journal establish and maintain
format requirements)?
- How
complex are the formats in terms of variations, computation, volume?
Cornell editorial
staff: RLG DigiNews currently uses GIF and JPEG for images and
the current version
of HTML for text markup. Current RLG DigiNews manuscript submission
policy expands the acceptable article submission formats from ASCII to include
Microsoft Word and RTF (Rich Text Format).. Current versions are used.
We adhere to open common standards. Mark-up by the editorial staff
is consistent and follows established standards. The RLG staff adheres to
established procedures for validating the content before posting each issue.
The journal tracks trends in Web site design and management, but uses technology
that is readily available to avoid inhibiting use or maintenance. The staff
does the mark-up using format requirements that adhere to RLG requirements
and good practice. The formats used are not very complex. The content is
intended for easy use in international settings.
5. Authenticity
and Provenance
- What
is the policy on correcting mistakes? Is the original version maintained
or the changes noted?
The Editorial
Policy for RLG DigiNews is as follows: "Upon discovery and notification
to RLG, the error is corrected and a note is inserted into the text to explain
the reason why the text was corrected, as well as the date of the correction."
6. Redundancy
- Is
the publication mirrored? If so, where and in how many places?
- Is
there a formal agreement for mirroring in the works?
Robin Dale:
The publication is not currently mirrored though we are considering some
possible arrangements. Discussions with specific institutions are in the
preliminary phase, though implementation of any agreement probably won't
take place until at least the end of 2002.
7. Metadata
- Technical
(are the technical approaches well documented, e.g. use of javascript,
the guts of the technical application, dependency on external programs
and scripts, documentation on changes):
Cornell editorial
staff: Yes, the source code includes scripts and these are well-documented.
We document changes in policy and practice.
- Navigation
(nature and extent of descriptive and structural metadata, e.g., SGML,
Table of contents, consistency of approach, etc.):
RLG DigiNews
uses Dublin Core metadata elements and keywords for descriptive metadata.
Each issue contains an embedded table of contents to help users navigate
through the content. The mark-up is consistent from issue to issue, and
changes in the structure and presentation of the content are noted at the
time they are implemented.
- Resource
discovery: How can people find the journal?
- What
search engines and abstracting services pick them up?
- What
practices promote/inhibit resource recovery (e.g., use of metatags)?
- Does
the journal provide indexing/searching features itself?
Each issue of
the journal is announced on major professional electronic mailing lists.
RLG DigiNews is actively promoted by RLG and features prominently
on the RLG Web site. As noted earlier RLG DigiNews is well-represented
on institutional, organizational, and personal web sites devoted to digital
imaging and preservation information. RLG DigiNews uses META tags
for Dublin Core data elements, for keywords, and for high-level content
elements. Consistent and correct formatting also promotes resource discovery.
The RLG DigiNews site provides Author and Title indices, as well
as links to back issues and basic full-text search capability. RLG also
permits crawling of its site by major search engines, including Google,
to facilitate resource discovery by users.
8. External
dependencies
- Does
the publication use consistent/persistent link naming?
- What's
the nature and extent of dead links?
- Are
links really dead or just moved?
- What
policy does the publication follow, if any, when including external
links, e.g., linking priorities, extent of monitoring and updating?
Cornell editorial
staff: RLG DigiNews does not use any of the persistent link approaches.
Each issue is a single document from which individual articles can be printed,
so each article does not have a unique identifier. The naming of issues
is consistent. RLG DigiNews incorporates many links into every issue,
including the Highlighted Web Site (HWS). All of the back issues of RLG
DigiNews are available on the RLG site and the staff uses link analyzers
to monitor the site, though past links are not corrected if the sites are
moved or removed. For instance, of the 34 HWS in the first five volumes
of RLG DigiNews, 27 are still active and seven have moved. Of those
seven, two have merged into one site, one has linked to a new site through
a redirect, and one link has become corrupted.
The sidebar by Richard Entlich discusses the issue
of link integrity.
9. External,
third party archives
- What
external archives cover the journal and how complete is the coverage;
to what extent could the journal be recreated from these archives?
Robin
Dale: The Internet Archive
can capture pages and sites that might not be saved otherwise and may
be a piece of a retention program but should not be viewed as a substitute
for a digital preservation program for RLG DigiNews. The Internet
Archive holds copies of all issues of RLG DigiNews with the exception
of the last, Volume
6, Number 1, 15 February 2002. This is because the most recent crawl
of the site was on 7 February 2002. Since crawls of the site tend to take
place every 8 to 9 weeks or sometimes longer, I'd imagine that this "missing
issue" will be covered soon. Other than that, the journal could be
recreated.
Cornell Staff Note: The Internet Archive recently launched the
"Wayback Machine," an online tool to search the archives's vast
holdings. When we used the Wayback Machine to search for RLG DigiNews
we discovered some interesting results, as indicated by the table below.
Note that the first copy listed for the April 1997 issue was from May
of that year, while the first copy obtained for the RLG DigiNews
home page was from December 1998, a year-and-a-half after the journal
was first published.
| |
First
Capture |
| RLG
Diginews home page |
Dec
2, 1998 |
| April
1997 issue |
May
3, 1997 |
| April
1998 issue |
Aug
15, 2000 |
| April
1999 issue |
Dec
9, 2000 |
| April
2000 issue |
Aug
17, 2000 |
| April
2001 issue |
Apr
18, 2001 |
Table 1. The Wayback Machine capture dates for RLG DigiNews issues
Besides the
Internet Archive, the National Library of Australia's Preserving Access
to Digital Information (PADI) initiative has established the Safekept
program. Being selected for the Safekept program provides an opportunity
for organizations to establish, review and/or enhance their Web site preservation
programs. The SafeKept program identifies a nine-step program for insuring
the preservation of the selected Web sites, to which contributing organizations
must adhere. The Safekept program has many, but not all issues of RLG
DigiNews marked within its databases. This "minimalist"
approach was borne out of an agreement with PADI that RLG would provide
for/maintain all of RLG DigiNews in many ways and formats and therefore
it wasn't absolutely necessary for NLA to do the same.
10. Look
and Feel
- Is
the old design and any functionality of early issues maintained (e.g.,
interface changes)?
Cornell editorial
staff: Yes. The back issues include the masthead that was in
use when the issue was published. Users can view changes in the journal's
presentation and format. Earlier we noted the various changes and introductions
to the journal.
11. Virtual
content
- How
much of the content is virtual (e.g., created at the point of access,
or generated on-the-fly)? How well can this virtual content be maintained?
Cornell editorial
staff: The content of RLG DigiNews is captured in static
HTML documents. Recent issues contain a script that monitors use of the
site, but there is no virtual content to be maintained.
12. Ability
to retain extended (added-value) services of the journal
- Is
preserving its function as a ready reference database or other information
services supported?
Robin Dale:
The functionality of RLG DigiNews is easy to maintain. The search
capability is upgraded periodically to continue to provide basic searching
using current technology. Features that provide access to the content are
based primarily upon links that support navigation between issues and identifying
topics and authors of interest.
What
do our readers think?
In our
February 2002 issue, we included a readers' survey to help guide our planning
for the next five years of RLG DigiNews. Responses
came from 233 readers, including 66 written comments. Overall, we received
an abundance of valuable information, for which we thank everyone who offered
their time and thoughtful feedback.

We
were pleased to discover that 87 percent of respondents claimed that the
content of RLG DigiNews is "just about right" in its
usual level of technical detail. We found it interesting that two-thirds
of respondents favor digital imaging features over other parts of the
journal. However, respondents were split on the type of content they prefer;
39 percent favor policy recommendations, while 28 percent prefer information
on technical standards and best practices, and 27 percent favor equipment
reviews. Preferences may be changing, as we found that the number of new
readers (those who have discovered the journal within the last two years)
were slightly more numerous than responses from long-time readers.
We were also
interested to learn that 60 percent of respondents discovered RLG DigiNews
through listservs, while only 23 percent learned about it from colleagues
and just 10 percent through other publications and Web sites. This is
no doubt due to posting announcements of the release of new issues on
20 listservs worldwide.
In sum, as we begin our sixth year of publication, the state of RLG
DigiNews appears healthy. Although we can't forecast technological
advances and readers' interests for 2008, we can expect at least as many
changes as the past five years have witnessed. The editorial staff will
maintain flexibility, will periodically take the pulse of our readers,
and will look forward to writing the 10th anniversary article!
Link
Analysis in RLG DigiNews
Richard Entlich
Once
an issue of RLG DigiNews goes "to press" its content
is considered fixed. Subsequent changes are only made to correct significant
errors, and those are always documented within the issue. Although
such a policy guarantees the editorial integrity of the publication,
it also means that, over time, the links to external Web sites will
gradually degrade.
How bad a problem is "link rot?" Pretty bad, as any regular
Web surfer will quickly tell you. Not only does Web site content change
and move around a lot, but domain names lapse and are reassigned on
a regular basis.
We conducted an analysis of links to external Web sites from the first
five years of RLG DigiNews issues. The results are shown below.
When all the issues for each year of publication are averaged together,
there is a nearly linear increase in the annual percentage of bad
links, from about 10% in our most year of publication (2001) up to
about 40% in our first year (1997).
There were a total of 1236 unique external links in the 28 issues
of RLG DigiNews published prior to this issue. How did we determine
the validity of so many links? For the most part, we did what most
people would do when faced with such a large task: we took advantage
of automation and employed a software link checker. Writing a link
checker must either be a very popular assignment in computer science
classes or else part of a rite of passage into the world of open source
computing, because there are dozens of link checkers available. Many
are freeware or shareware or free online services. A good list of
available products and services (with a few of its own bad links)
is available at http://www.elsop.com/wrc/comp_ls.htm.
As one might expect with such an abundance of products on the market,
there are some substantial differences in features and performance.
We tested only a few products, but found some considerably more flexible
and powerful than others. The differences have an impact not only
on ease of use and reporting capability, but on the validity of the
results.
The
operational basis for link checkers is fairly straightforward. Much
like a browser, a link checker connects to Web sites using http (hypertext
transfer protocol) and gathers data from the server. However, instead
of rendering the data into a viewable Web page, it extracts status
information that is part of every http exchange, but not always displayed
to the end user. Most everyone has seen a message reading "404
Not Found" when trying to access a URL that's no longer available.
The 404 is one of dozens of http status codes defined in the protocol.
A "successful" http transaction usually produces a status
code such as 200 (meaning "OK") or perhaps 301 or 302 (for
a redirect to a new location). Status codes for properly completed
http transactions are not normally displayed to users.
How accurate a picture of a Web site's link status does a simple compilation
of http status codes provide? That depends a lot on one's definition
of a "good"' or "bad" link. For our purposes,
we defined a good link as one that leads to (roughly) the content
intended by the original reference, either directly or through an
automated redirection or refresh. We even accepted situations where
one additional manual click would be required to find the original
content, as long as the path to that material was fairly evident on
the first page brought up.
Unfortunately, even given this fairly liberal definition of "good,"
most link checkers relying on simple status codes will significantly
overestimate the number of good links on a site. There are many situations
in which a site that appears good to a link checker may in fact fall
short of our definition of good.
- Status
code does not equal content.
Just because a URL is still good doesn't mean the content hasn't
changed dramatically. The domain name registration may have lapsed
and been purchased by another entity. There may be no connection
whatsoever with the original content or content provider.
- Putting
out mixed messages. While displaying html content that clearly
says "bad," the status code sent out via http says "good."
This happens most frequently with sites that have substituted
a custom error message for missing pages and failed to associate
the proper status code with it. It's obvious on manual examination
that these pages are bad, but without special effort, a link checker
won't detect it. Depending on how they obtain their status codes,
different link checkers may vary in their reporting on such sites.
- Moved
or lost in transit? There are various ways for sites to indicate
that content has moved to a new location. One method (called a
meta-refresh) is usually reported as good as long as the page
with the "we've moved" message loaded correctly. There
are also different kinds of redirects. Redirects produce their
own status codes, so at least one is alerted that something may
be amiss, but a link checker may assume that as long as the redirect
succeeds, the page is good. Frequently this is not the case. Also,
since assignment of redirect codes requires manual intervention
by site operators, it is prone to misapplication.
The
figures in the graph above were not taken at face value from a link
checker. We did some manual checking of the results in order to verify
their accuracy. We manually checked all reported codes for permanent
or temporary redirections. We found that 25% of the permanent redirections
and nearly 30% of the temporary redirections did not pass our definition
of good. We also checked certain other reported server problems. We
adjusted the results in the graph to reflect these findings.
We
did not manually check all the links reported as good by the link
checker, but we did test a subset, just to get an idea of the error
rate. Of the 68 links reported good in the three issues of RLG
DigiNews for 1997, 14 (a little over 20%) did not pass muster.
This discrepancy is not accounted for in the graph above, so, according
to our definitions, the percentages of bad links in RLG DigiNews
it shows should be higher.
Despite
these limitations, however, link checkers can form an essential part
of a Web maintenance program, whether for your own site, or an external
site that contains resources important to you. But it is important
to understand that link check reports cannot necessarily be taken
at face value. More expensive link checkers provide configuration
and customization options that can help produce more accurate status
assessments. Given the abundance of available link checkers, our advice
is to take advantage of freeware as well as trial versions of shareware
and commercial products to find the product that meets your needs. |
Peter
Botticelli, Robin Dale, Carla DeMello, Barbara Berger Eden, Richard Entlich,
Anne R. Kenney, and Nancy McGovern
Publishing
Information
RLG DigiNews
(ISSN 1093-5371) is a newsletter conceived by the members of the Research
Libraries Group's PRESERV community. Funded in part by the Council on
Library and Information Resources (CLIR) 1998-2000, it is available internationally
via the RLG PRESERV
Web site (http://www.rlg.org/preserv/). It will be published six times
in 2001. Materials contained in RLG DigiNews are subject to copyright
and other proprietary rights. Permission is hereby given for the material
in RLG DigiNews to be used for research purposes or private study.
RLG asks that you observe the following conditions: Please cite the individual
author and RLG DigiNews (please cite URL of the article) when using
the material; please contact Jennifer
Hartzell, RLG Corporate Communications, when citing RLG DigiNews.
Any use other than for research or private study of these materials requires
prior written authorization from RLG, Inc. and/or the author of the article.
RLG DigiNews is produced for the Research Libraries Group, Inc. (RLG)
by the staff of the Department of Preservation and Conservation, Cornell
University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern;
Production Editor, Barbara Berger Eden; Associate Editor, Robin Dale (RLG);
Technical Researchers, Richard Entlich and Peter Botticelli; Technical
Coordinator, Carla DeMello.
All links in this issue were confirmed accurate as of April 12, 2002.
Please send your comments and questions to preservation@cornell.edu.

|