| |
Editors'
Interview
The Internet
Archive
Brewster Kahle
The Internet Archive
brewster@archive.org
Editors' Note
The editors interviewed Brewster Kahle by phone on May 15, 2002. Here
is an edited version of the interview. Brewster Kahle is the founder
and director of the Internet Archive;
co-founder of Alexa Internet,
an Internet-focused company that concentrates on Web navigation tools
and techniques; and inventor and founder of Wide Area Information
Servers, Inc. The Internet Archive launched the Wayback Machine, a
Web site that provides an interface to the Internet Archive collections,
in October 2001.

General Operations
You launched the Wayback Machine a little
more than six months ago. How has the actual response compared
to your expectations?
We have gotten more usage than I
thought we would. We get 20,000 different users each day. There
are now parts of the Web that link to the past. What does it mean
to have the past as part of the present? The Wayback Machine injects
the past into the present. The Web is a self-documenting, self-cataloging
machine. Unlike books, with the Web, the catalog and content are
part of the same thing, but not completely.
We were first on the Yahoo
Internet Life list of top 100 Web sites for 2001. Der Spiegel
named us "Hit of the Year," and we received the 3rd Digital
Archive Award in Kyoto, Japan. If you look at use by country, the
greatest number of hits comes from Japan. We've hit a nerve. People
care about their own history. They're psyched about it. On the Web,
anyone can be a publisher, now there is a library for their work.
Also, the site has been used in more
situations than expected. It has been interesting all around. For
example, a group of Masters students at the Berkeley School of Information
Management and Systems this spring presented the results of a series
of interviews on why people use the Wayback Machine. They discovered
that looking up personal Web pages is a major use. "Did I make
it into history?" people wonder. After that, organizations
look for their sites. I thought people would poke around, have some
fun. It provides a reference service.
What kind of organizational infrastructure
does it take to run the Internet Archive and the Wayback Machine?
There are currently seven full-time
employees and 14 interns at the Internet Archive. It would be an
understatement to say that we did everything with that number. We
contract with many organizations. Alexa Internet has thirty people,
but not all of them are working on the Internet Archive. Compaq
actively supports us. It's a ten million dollar per year operation
to do the tasks we're currently doing.
Content and Capture
Do the rate of growth and sheer
size of the Internet Archive present special technological problems?
Oh, yeah. The dataset we have is
a queryable Web collection, which makes it one of the largest databases
ever made. There are up to 200 queries per second. That combination,
we believe, is unprecedented. Our claim to fame is our level of
frugality. We don't have a lot of money.
We need to be willing to use innovative
tools and techniques, to try the very newest ideas. The classic
technology approach would be to buy a database and implement RAID
(Redundant Array of Independent Disks). For 1 terabyte (TB)
you'd be looking at $400,000 for EMC disk storage, $150,000 for
a Sun box, and $100,000 for an Oracle database. Effectively, that
would be almost a million dollars for 1 TB. Like Google and Hotmail,
we use a large number of PCs in parallel. Our technology has to
be state-of-the-art to keep up. We collect 10 TB per month. Our
approach has to be cost-effective and flexible. For instance, we
moved off tape to removable hard drives.
You have a number of special
collections on your site, such as Election 1996. Do you have an ad
hoc approach as subjects of interest surface or do you have a collection
plan? The Internet Archive FAQ encourages donations of collections.
Will your special collections program expand?
All our special collections have
been done in partnerships. Someone else helps with the curatorship,
selection and quality assessment. The 1996 Election collection became
a kiosk in the Smithsonian. They thought it possible that the Web
might be the next "bumper sticker" in campaigning. The
Election collection grew into a big project with the Library of
Congress for 2000. There were lots of questions"could
you?" "should you?" "are you allowed to do this?"
We see special collections growing through partnerships into national
collections. It is inconceivable for countries not to record their
digital heritage. A lot of history is born digital. This should
not be like early television where there is not a record.
Your policy is to collect Web
pages that are publicly available. Do you have any estimates of how
much of the Web is inaccessible to your crawlers?
The Web is effectively infinite.
If you start crawling, you'd never end. How big is our collection?
Enormousreally big. We want to increase accessresearchers'
accessand have them help guide the collection. We have proven
you can do something and it can be useful. We already have some
of the deep Webwhere the crawl can generate each page for
capture. What we want now is complete collection of important things.
We want more critiques of collections from researchers, historians
and scholarslike "'it would have been great if
"
These comments will inform the collections.
The Internet Archive FAQ indicates
that the addition of a robot exclusion to a Web site will "lead
to the removal of the pages from the Wayback Machine." What amount
of content has been removed as a result of this policy? How is the
removal documented?
We check for robot exclusions and
retroactively exclude content for those sites. There are some classes
of people that have opted out. Many newspapers do not want to be
included. Individuals are all over the map. Photographers feel disadvantaged;
they don't feel that being part of it will help them. This represents
a very small percentage. We do not document the removal of content,
but we collect robot exclusions. At first, we didn't collect them.
Ben Edelman,
a researcher at Harvard Law School, is looking at the relationship
between domain name ownership and robot exclusions.
Web Crawlers and Access
As of October 2001, the Internet Archive offered
access to roughly 10 billion pages. How long does it take to crawl the
entire Web? Do you give any preference to crawling particular portions
of the Internet more often or more thoroughly? If so, how?
There are more pages now, but we do not have a more
recent number. The Internet Archive receives most of its Web pages
from donations from Alexa Internet. They do a new crawl every two
months. We crawl for special collections weekly or daily. If we discover
a new page, we crawl it within 24 hours. We do a complete sweep every
two months.
A powerful Web crawler must
be essential to the development of the Internet Archive. Could you
describe how your Web crawler support has changed in response to the
evolution of this technology since you began the Internet Archive
in 1996?
The crawler technology is constantly
being rewritten. Every 12-18 months Alexa throws out the existing
crawler and starts over from scratch. Revisions to the crawler reflect
changes in the structure of the net, its size, the underlying technologies,
etc. What we have now is the best available crawler.
In searching for specific pages/sites
on the Wayback Machine, we've noticed varying gaps in content, sometimes
within a page, e.g., missing images, sometimes within a site, e.g.,
missing pages, and sometimes over time, e.g., long gaps between capture
dates. What strategies have you developed for crawling to maximize
capture and to validate the content that is captured, or do you use
a passive capture approach?
All of the laundry from the past
is shown to everybody in this collection! Some gaps were the Internet
Archive's fault and some were not. Some missing images are due to
robot exclusions. The technology has changed. From 1996-1998, the
crawler crawled a full Web site or as many pages as it wanted in
one day, so there'd be a clean copy. Other times, it might follow
up latermany days later. Different crawl philosophies were
used. The 1999 crawls do not contain a lot of images because we
did not have enough bandwidth for text plus images. There were months
when there was no crawling at all while the crawler was being rewritten.
Your policy is to provide "free
access to the Internet collections to researchers, historians, and
scholars through an account on a Unix server." Do you allow Web
crawlers to access the Internet Archive? Does the Wayback Machine
provide access to all of the Internet Archive content?
Our terms of service are that users
respect other peoples' privacy and copyright. You could interpret
that as "don't copy from the Wayback Machine or the Unix archives
collections without talking to the Internet Archive." We have
robot exclusions on the Internet Archive collections. We do make
case-by-case exceptions.
Does the Internet Archive support
the Open Archive Metadata Harvesting Protocol?
We haven't supported OAI
(Open Archives Initiative) yet, but we want to. We work with users.
We want to be as standards-based as possible as long as people would
use it. We are willing to be an OAI data provider or service provider.
Digital Archives: Technical
Issues
We have a series of preservation-related questions
for you. First, has the Open Archival Information System (OAIS) reference
model influenced the development of the Internet Archive? If not, is
the Internet Archive based upon a digital archive model?
Your FAQ on storage and preservation
indicates that you will migrate your storage media but use emulation
for your file formats. Did you consider file migration? Are you exploring
other options? Are you working on any research in these areas?
Once archived, we never change a
page. The Web wasn't constructed to be archived. It's so interconnected.
A book exists outside of time. Archiving Web sites is like putting
together a bomb after it has exploded. We do as good a job as we
know how. If anyone knows how to improve it, please let us know
or help us change it. We can't force people to archive in a certain
format. The Wayback Machine is the best way we know to look at the
results.
File format changes haven't become
a problem for us yet. Most old HTML is still supported. Sometimes
we copy files forward. We store things in the exact way they were
gotten. We are very careful at ingest to record all metadata. When
using the Wayback Machine the links on a page are changed on the
fly to point into the past. You need to watch the URL when you're
surfing the Wayback Machine to answer the question"When
was this page archived?" We tried putting a frame around the
page, but that is technically hard to do. Users, we are afraid,
forget to look at the URL, but we don't know a better way.

Figure 1. JFK Library Web page
captured August 15, 2000 points readers to news releases from 2002
via the "What's New" link because the frame is from the
earlier date and the news page is from the current date. The frame
is using referential linking to the current page.
(Click on image to enlarge)
Your preservation FAQ references the ARC format proposed by
Alexa Internet as an Internet object archiving standard. Is Alexa
trying to make this a formal standard, and if so, through which standards
body?
We haven't approached a standards
body at this time. It's a lot of work. I was in charge of an IETF
group. I found that a couple of people working together works best.
Put something together and throw it out there. If people pick it
up, it's worth going through the headaches. Some others are doing
Web archiving, but with their own tools. We have started active
outreach to other projects to see if we can share tools, formats,
and experiences.
We understand that you are trying
to maintain multiple copies of the entire set of collections. Are
these mirror sites? Does the amount of content present special technical
issues for establishing redundant storage?
Mirrors? Yes. We donated a complete
copy of the collection to the new Library of Alexandria. I spent
April 2002 in Alexandria, Egypt. It was a kick to see the machine
it's onit's front and center when you walk in. The best method
for preservation is replication. Put copies in as many different
contexts as possibleone in RAID, bury a copy. Put a copy in
someone else's hands. They will care for it differently, and take
care of some bugs. Usually books are destroyed because the new regime
is not interested in the old. We learned a lesson from Alexandria;
we want to place duplicates on other continents. Redundancy is antithetical
to current culture. Libraries often strive to create unique collections
that are proprietary. We are less proud of collected materials if
there are other copies. We need to show positive examples to change
the way things work.
Digital Archives: Financial
Issues
Your Web site states that the Internet Archive
was founded to "build an 'Internet library' with the purpose of
offering permanent access" to the content. Have you established
a sustainable funding model for the Internet Archive? Is there a business
continuity plan that will ensure ongoing access to the collections?
Yes, we have a sustainable funding
model. Everything that has been donated is safe. There is enough
money to make sure that it's safe forever. Considering Moore's
lawwhat we have is sustainable for the present, but access
is more expensive. We need to blend with the research community.
We need to put our hand out for support to further the work on access.
We noticed the monetary donation
button on the Internet Archive site via Amazon.com. Is this similar
to NPR's pledge drives? What kind
of response have you gotten? Do you use those funds for specific purposes?
There has been almost no response
to pledge unless you badger. The real money has come from private
foundations, government support, and in-kind donations from corporationsZoe
Baird and the Markle Foundation,
NSF, and LC. We need to expand the list by making this part of peoples'
lives, to open new doors.
Legal Issues
Can you discuss your recent brief for the Supreme
Court case, Eldred v. Ashcroft? Do you anticipate the Internet Archive
becoming more active in such cases?
We are not bringing the cases. People are using
the Internet Archive as an example of how they see the future. Peter
Lyman suggested in his paper "Why Archive the Web?" that
the Internet is the information resource of first resort for millions
of readers. Some say the good stuff is not on the Net. I have a
sinking feeling that if the young are learning from the Web, and
the Web is not offering the best, that we are shortchanging our
youth. This is criminal. It makes no sense. The expectation that
materials for learning should be on the Web brings urgency to the
issue. Wethe establishment (face it, we are)are screwing
up. We do not need more meetings on metadata. We need to get our
cultural heritage easily accessible on the Web. Newspapers are up
there. Academic journals are becoming available. Books, music, videos,
and TV are trying but they are suing their way into the next centuryor
the last one. That all this stuff ends up in the courts is dumb;
that approach doesn't make sense. Raj Reddy promotes "universal
access to human knowledge." Some for free and some for pay.
If we take as our premise that all information should be at our
fingertips, how do we get there? The public library system is a
$25 billion industry. $5-6 billion goes to publishers. If students
aren't consulting with the Library of Congress daily then let's
get on with it. If we wait for 20 years, we could shortchange a
human generation. Kids could grow up without the best works of the
21st century. Our libraries contain large amounts of pre-20th century
materials which could all be online now because there are no rights
issues. Raj Reddy says we are moving too slowly. We are trying to
do the million
books project. A million books for free, great. India and China
are in, great. The US, not reallyCornell, Maththese
are dribs and drabs. Kids are demanding a different way. RLG and
the research librariesthese are rich institutions and they
can be doing much more. The technology is not hard. Raj Reddy says
it takes 2-4 hours to scan a book. Not the most interesting 2-4
hours, but then that book can be offered to the world. Gutenberg
showed how it works. We need to do our job.
Future Plans
How do you measure success for the Internet
Archive? What are your plans? Do you have specific goals that you
would like to achieve?
We want to grow our collections, but grow them in
ways that they are useful to traditional library usersresearchers,
scholars, and the underserved. On the Web we can put tools and technology
on top of collectionsa search engine to answer harder questions.
We can bring tools and information together in new ways that weren't
possible before the Web. We think we have an archive, we want to
build a digital library. We need partnerships to do this. We have
collections but not a lot of finding aids. The top-level best thing
for the community is universal access to human knowledge. It is
within our grasp. We need to coordinate our efforts and just do
it. I leap out of bed in the morning fueled by this possibility.
We have the ability to touch peoplelike poor kids in Yemen.
When the goal sinks in, every day is surprising. It straightens
the road, it's not curvier. We can leave a legacy which will make
our grandkids proud of us. This is not fanciful. We have technology
for storing, for accessing. We have the political will to live in
an open society. Libraries are now stepping up to the challenge.

Publishing
Information
RLG DigiNews
(ISSN 1093-5371) is a newsletter conceived by the members of the Research
Libraries Group's PRESERV community. Funded in part by the Council on
Library and Information Resources (CLIR) 1998-2000, it is available internationally
via the RLG PRESERV
Web site. It will be published six times in 2002. Materials contained
in RLG DigiNews are subject to copyright and other proprietary
rights. Permission is hereby given for the material in RLG DigiNews
to be used for research purposes or private study. RLG asks that you observe
the following conditions: Please cite the individual author and RLG
DigiNews (please cite URL of the article) when using the material;
please contact Jennifer Hartzell,
RLG Corporate Communications, when citing RLG DigiNews.
Any use other than for research or private study of these materials requires
prior written authorization from RLG, Inc. and/or the author of the article.
RLG DigiNews is produced for the Research Libraries Group,
Inc. (RLG) by the staff of the Department of Preservation and Conservation,
Cornell University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern;
Production Editor, Barbara Berger Eden; Associate Editor, Robin Dale (RLG);
Technical Researchers, Richard Entlich and Peter Botticelli; Technical
Coordinator, Carla DeMello; Technical Assistant, Kimberly Gazzo.
All links in this issue were confirmed accurate as of June
10, 2002.
Please send
your comments and questions to preservation@cornell.edu.

|