RLG DigiNews
BROWSE ISSUES
SEARCH
RLG
   
  June 15, 2002, Volume 6, Number 3
ISSN 1093-5371


Editors' Interview

The Internet Archive

Brewster Kahle
The Internet Archive
brewster@archive.org

Editors' Note
The editors interviewed Brewster Kahle by phone on May 15, 2002. Here is an edited version of the interview. Brewster Kahle is the founder and director of the Internet Archive; co-founder of Alexa Internet, an Internet-focused company that concentrates on Web navigation tools and techniques; and inventor and founder of Wide Area Information Servers, Inc. The Internet Archive launched the Wayback Machine, a Web site that provides an interface to the Internet Archive collections, in October 2001.

General Operations

You launched the Wayback Machine a little more than six months ago. How has the actual response compared to your expectations?

We have gotten more usage than I thought we would. We get 20,000 different users each day. There are now parts of the Web that link to the past. What does it mean to have the past as part of the present? The Wayback Machine injects the past into the present. The Web is a self-documenting, self-cataloging machine. Unlike books, with the Web, the catalog and content are part of the same thing, but not completely.
We were first on the Yahoo Internet Life list of top 100 Web sites for 2001. Der Spiegel named us "Hit of the Year," and we received the 3rd Digital Archive Award in Kyoto, Japan. If you look at use by country, the greatest number of hits comes from Japan. We've hit a nerve. People care about their own history. They're psyched about it. On the Web, anyone can be a publisher, now there is a library for their work.
Also, the site has been used in more situations than expected. It has been interesting all around. For example, a group of Masters students at the Berkeley School of Information Management and Systems this spring presented the results of a series of interviews on why people use the Wayback Machine. They discovered that looking up personal Web pages is a major use. "Did I make it into history?" people wonder. After that, organizations look for their sites. I thought people would poke around, have some fun. It provides a reference service.
What kind of organizational infrastructure does it take to run the Internet Archive and the Wayback Machine?
There are currently seven full-time employees and 14 interns at the Internet Archive. It would be an understatement to say that we did everything with that number. We contract with many organizations. Alexa Internet has thirty people, but not all of them are working on the Internet Archive. Compaq actively supports us. It's a ten million dollar per year operation to do the tasks we're currently doing.
Content and Capture
Do the rate of growth and sheer size of the Internet Archive present special technological problems?
Oh, yeah. The dataset we have is a queryable Web collection, which makes it one of the largest databases ever made. There are up to 200 queries per second. That combination, we believe, is unprecedented. Our claim to fame is our level of frugality. We don't have a lot of money.
We need to be willing to use innovative tools and techniques, to try the very newest ideas. The classic technology approach would be to buy a database and implement RAID (Redundant Array of Independent Disks). For 1 terabyte (TB) you'd be looking at $400,000 for EMC disk storage, $150,000 for a Sun box, and $100,000 for an Oracle database. Effectively, that would be almost a million dollars for 1 TB. Like Google and Hotmail, we use a large number of PCs in parallel. Our technology has to be state-of-the-art to keep up. We collect 10 TB per month. Our approach has to be cost-effective and flexible. For instance, we moved off tape to removable hard drives.
You have a number of special collections on your site, such as Election 1996. Do you have an ad hoc approach as subjects of interest surface or do you have a collection plan? The Internet Archive FAQ encourages donations of collections. Will your special collections program expand?
All our special collections have been done in partnerships. Someone else helps with the curatorship, selection and quality assessment. The 1996 Election collection became a kiosk in the Smithsonian. They thought it possible that the Web might be the next "bumper sticker" in campaigning. The Election collection grew into a big project with the Library of Congress for 2000. There were lots of questions—"could you?" "should you?" "are you allowed to do this?" We see special collections growing through partnerships into national collections. It is inconceivable for countries not to record their digital heritage. A lot of history is born digital. This should not be like early television where there is not a record.
Your policy is to collect Web pages that are publicly available. Do you have any estimates of how much of the Web is inaccessible to your crawlers?
The Web is effectively infinite. If you start crawling, you'd never end. How big is our collection? Enormous—really big. We want to increase access—researchers' access—and have them help guide the collection. We have proven you can do something and it can be useful. We already have some of the deep Web—where the crawl can generate each page for capture. What we want now is complete collection of important things. We want more critiques of collections from researchers, historians and scholars—like "'it would have been great if…" These comments will inform the collections.
The Internet Archive FAQ indicates that the addition of a robot exclusion to a Web site will "lead to the removal of the pages from the Wayback Machine." What amount of content has been removed as a result of this policy? How is the removal documented?
We check for robot exclusions and retroactively exclude content for those sites. There are some classes of people that have opted out. Many newspapers do not want to be included. Individuals are all over the map. Photographers feel disadvantaged; they don't feel that being part of it will help them. This represents a very small percentage. We do not document the removal of content, but we collect robot exclusions. At first, we didn't collect them. Ben Edelman, a researcher at Harvard Law School, is looking at the relationship between domain name ownership and robot exclusions.
Web Crawlers and Access
As of October 2001, the Internet Archive offered access to roughly 10 billion pages. How long does it take to crawl the entire Web? Do you give any preference to crawling particular portions of the Internet more often or more thoroughly? If so, how?

There are more pages now, but we do not have a more recent number. The Internet Archive receives most of its Web pages from donations from Alexa Internet. They do a new crawl every two months. We crawl for special collections weekly or daily. If we discover a new page, we crawl it within 24 hours. We do a complete sweep every two months.

 
A powerful Web crawler must be essential to the development of the Internet Archive. Could you describe how your Web crawler support has changed in response to the evolution of this technology since you began the Internet Archive in 1996?
The crawler technology is constantly being rewritten. Every 12-18 months Alexa throws out the existing crawler and starts over from scratch. Revisions to the crawler reflect changes in the structure of the net, its size, the underlying technologies, etc. What we have now is the best available crawler.
In searching for specific pages/sites on the Wayback Machine, we've noticed varying gaps in content, sometimes within a page, e.g., missing images, sometimes within a site, e.g., missing pages, and sometimes over time, e.g., long gaps between capture dates. What strategies have you developed for crawling to maximize capture and to validate the content that is captured, or do you use a passive capture approach?
All of the laundry from the past is shown to everybody in this collection! Some gaps were the Internet Archive's fault and some were not. Some missing images are due to robot exclusions. The technology has changed. From 1996-1998, the crawler crawled a full Web site or as many pages as it wanted in one day, so there'd be a clean copy. Other times, it might follow up later—many days later. Different crawl philosophies were used. The 1999 crawls do not contain a lot of images because we did not have enough bandwidth for text plus images. There were months when there was no crawling at all while the crawler was being rewritten.
Your policy is to provide "free access to the Internet collections to researchers, historians, and scholars through an account on a Unix server." Do you allow Web crawlers to access the Internet Archive? Does the Wayback Machine provide access to all of the Internet Archive content?
Our terms of service are that users respect other peoples' privacy and copyright. You could interpret that as "don't copy from the Wayback Machine or the Unix archives collections without talking to the Internet Archive." We have robot exclusions on the Internet Archive collections. We do make case-by-case exceptions.
Does the Internet Archive support the Open Archive Metadata Harvesting Protocol?
We haven't supported OAI (Open Archives Initiative) yet, but we want to. We work with users. We want to be as standards-based as possible as long as people would use it. We are willing to be an OAI data provider or service provider.
Digital Archives: Technical Issues
We have a series of preservation-related questions for you. First, has the Open Archival Information System (OAIS) reference model influenced the development of the Internet Archive? If not, is the Internet Archive based upon a digital archive model?
I guess not. Should we? We would like to comply with prevailing standards. We are currently reviewing the RLG/OCLC Trusted Digital Repositories white paper.
Your FAQ on storage and preservation indicates that you will migrate your storage media but use emulation for your file formats. Did you consider file migration? Are you exploring other options? Are you working on any research in these areas?
Once archived, we never change a page. The Web wasn't constructed to be archived. It's so interconnected. A book exists outside of time. Archiving Web sites is like putting together a bomb after it has exploded. We do as good a job as we know how. If anyone knows how to improve it, please let us know or help us change it. We can't force people to archive in a certain format. The Wayback Machine is the best way we know to look at the results.
File format changes haven't become a problem for us yet. Most old HTML is still supported. Sometimes we copy files forward. We store things in the exact way they were gotten. We are very careful at ingest to record all metadata. When using the Wayback Machine the links on a page are changed on the fly to point into the past. You need to watch the URL when you're surfing the Wayback Machine to answer the question—"When was this page archived?" We tried putting a frame around the page, but that is technically hard to do. Users, we are afraid, forget to look at the URL, but we don't know a better way.


Figure 1. JFK Library Web page captured August 15, 2000 points readers to news releases from 2002 via the "What's New" link because the frame is from the earlier date and the news page is from the current date. The frame is using referential linking to the current page.
(Click on image to enlarge)

Your preservation FAQ references the ARC format proposed by Alexa Internet as an Internet object archiving standard. Is Alexa trying to make this a formal standard, and if so, through which standards body?
We haven't approached a standards body at this time. It's a lot of work. I was in charge of an IETF group. I found that a couple of people working together works best. Put something together and throw it out there. If people pick it up, it's worth going through the headaches. Some others are doing Web archiving, but with their own tools. We have started active outreach to other projects to see if we can share tools, formats, and experiences.
We understand that you are trying to maintain multiple copies of the entire set of collections. Are these mirror sites? Does the amount of content present special technical issues for establishing redundant storage?
Mirrors? Yes. We donated a complete copy of the collection to the new Library of Alexandria. I spent April 2002 in Alexandria, Egypt. It was a kick to see the machine it's on—it's front and center when you walk in. The best method for preservation is replication. Put copies in as many different contexts as possible—one in RAID, bury a copy. Put a copy in someone else's hands. They will care for it differently, and take care of some bugs. Usually books are destroyed because the new regime is not interested in the old. We learned a lesson from Alexandria; we want to place duplicates on other continents. Redundancy is antithetical to current culture. Libraries often strive to create unique collections that are proprietary. We are less proud of collected materials if there are other copies. We need to show positive examples to change the way things work.
Digital Archives: Financial Issues
Your Web site states that the Internet Archive was founded to "build an 'Internet library' with the purpose of offering permanent access" to the content. Have you established a sustainable funding model for the Internet Archive? Is there a business continuity plan that will ensure ongoing access to the collections?
Yes, we have a sustainable funding model. Everything that has been donated is safe. There is enough money to make sure that it's safe forever. Considering Moore's law—what we have is sustainable for the present, but access is more expensive. We need to blend with the research community. We need to put our hand out for support to further the work on access.
We noticed the monetary donation button on the Internet Archive site via Amazon.com. Is this similar to NPR's pledge drives? What kind of response have you gotten? Do you use those funds for specific purposes?
There has been almost no response to pledge unless you badger. The real money has come from private foundations, government support, and in-kind donations from corporations—Zoe Baird and the Markle Foundation, NSF, and LC. We need to expand the list by making this part of peoples' lives, to open new doors.
Legal Issues
Can you discuss your recent brief for the Supreme Court case, Eldred v. Ashcroft? Do you anticipate the Internet Archive becoming more active in such cases?

We are not bringing the cases. People are using the Internet Archive as an example of how they see the future. Peter Lyman suggested in his paper "Why Archive the Web?" that the Internet is the information resource of first resort for millions of readers. Some say the good stuff is not on the Net. I have a sinking feeling that if the young are learning from the Web, and the Web is not offering the best, that we are shortchanging our youth. This is criminal. It makes no sense. The expectation that materials for learning should be on the Web brings urgency to the issue. We—the establishment (face it, we are)—are screwing up. We do not need more meetings on metadata. We need to get our cultural heritage easily accessible on the Web. Newspapers are up there. Academic journals are becoming available. Books, music, videos, and TV are trying but they are suing their way into the next century—or the last one. That all this stuff ends up in the courts is dumb; that approach doesn't make sense. Raj Reddy promotes "universal access to human knowledge." Some for free and some for pay. If we take as our premise that all information should be at our fingertips, how do we get there? The public library system is a $25 billion industry. $5-6 billion goes to publishers. If students aren't consulting with the Library of Congress daily then let's get on with it. If we wait for 20 years, we could shortchange a human generation. Kids could grow up without the best works of the 21st century. Our libraries contain large amounts of pre-20th century materials which could all be online now because there are no rights issues. Raj Reddy says we are moving too slowly. We are trying to do the million books project. A million books for free, great. India and China are in, great. The US, not really—Cornell, Math—these are dribs and drabs. Kids are demanding a different way. RLG and the research libraries—these are rich institutions and they can be doing much more. The technology is not hard. Raj Reddy says it takes 2-4 hours to scan a book. Not the most interesting 2-4 hours, but then that book can be offered to the world. Gutenberg showed how it works. We need to do our job.

Future Plans

How do you measure success for the Internet Archive? What are your plans? Do you have specific goals that you would like to achieve?

We want to grow our collections, but grow them in ways that they are useful to traditional library users—researchers, scholars, and the underserved. On the Web we can put tools and technology on top of collections—a search engine to answer harder questions. We can bring tools and information together in new ways that weren't possible before the Web. We think we have an archive, we want to build a digital library. We need partnerships to do this. We have collections but not a lot of finding aids. The top-level best thing for the community is universal access to human knowledge. It is within our grasp. We need to coordinate our efforts and just do it. I leap out of bed in the morning fueled by this possibility. We have the ability to touch people—like poor kids in Yemen. When the goal sinks in, every day is surprising. It straightens the road, it's not curvier. We can leave a legacy which will make our grandkids proud of us. This is not fanciful. We have technology for storing, for accessing. We have the political will to live in an open society. Libraries are now stepping up to the challenge.





Publishing Information

RLG DigiNews (ISSN 1093-5371) is a newsletter conceived by the members of the Research Libraries Group's PRESERV community. Funded in part by the Council on Library and Information Resources (CLIR) 1998-2000, it is available internationally via the RLG PRESERV Web site. It will be published six times in 2002. Materials contained in RLG DigiNews are subject to copyright and other proprietary rights. Permission is hereby given for the material in RLG DigiNews to be used for research purposes or private study. RLG asks that you observe the following conditions: Please cite the individual author and RLG DigiNews (please cite URL of the article) when using the material; please contact Jennifer Hartzell, RLG Corporate Communications, when citing RLG DigiNews.


Any use other than for research or private study of these materials requires prior written authorization from RLG, Inc. and/or the author of the article.


RLG DigiNews is produced for the Research Libraries Group, Inc. (RLG) by the staff of the Department of Preservation and Conservation, Cornell University Library. Co-Editors, Anne R. Kenney and Nancy Y. McGovern; Production Editor, Barbara Berger Eden; Associate Editor, Robin Dale (RLG); Technical Researchers, Richard Entlich and Peter Botticelli; Technical Coordinator, Carla DeMello; Technical Assistant, Kimberly Gazzo.


All links in this issue were confirmed accurate as of
June 10, 2002.

Please send your comments and questions to preservation@cornell.edu.

   
 
RLG DigiNews
BROWSE ISSUES
SEARCH
RLG