Trip report

Internet Archive Colloquium

San Francisco
March 8 and 9, 2000

The Internet Archive Colloquium brought together 30 to 40 people with interests in the web as an object of study. They included members of the Internet Archive, search engine developers, theoretical computer scientists, researchers in information retrieval and multi-media, policy experts and digital library specialists. Almost all the top people in these fields were present and we had an enthralling two days. For example, the chief technical people from Google, Yahoo!, Alexa and Infoseek gave technical presentations and were remarkably frank about their systems.

See http://www.archive.org/ for the agenda and the list of attendees. This report highlights some of the most important topics from the perspective of the Library of Congress.

Data gathering

The foundation for the meeting was to describe the activities of the Internet Archive. Brewster Kahle described the current situation. The archive has been collecting the open-access html pages on the web since 1996, approximately monthly. The current holdings are 11 terabytes on tape, 2.5 terabytes on disk. The data gathering is carried out by Alexa, which is a commercial company. Alexa donates its data to the archive when it is six months old. They are currently setting up all the data on disk for researchers.

Several other organizations sweep the web regularly. For example, Google caches all pages once per monthly. Sadly, because of copyright concerns, this data is discarded at the end of the month. The logs of usage maintained by Yahoo! and others are also valuable data collections, but access is strictly controlled for reasons of privacy (and competition).

Technical

A standard technical environment has emerged for large-scale Internet processing and research. Everybody uses commodity PC hardware with Linux.  Perl is the standard programming language, for all activities except
the performance-critical parts of the systems.

Larry Page of Google gave the following data. Google has 85 people, of whom
half are technical and 14 have a Ph.D. in computer science. The central system handles 5.5 million searches daily, increasing 20% per month. They have 2,500 PCs running Linux, with 80 terabytes of spinning disk, and install an average of 30 new machines per day. The cache holds about 200 million html pages. The aim is to crawl the web once per month.

Udi Manber (chief scientist of Yahoo!) mentioned that Yahoo! has 100,000,000 registered users and dispatches 1/2 billion pages to users per day.

Graph theory research

A group of theoretical computer scientists, including Jon Kleinberg from Cornell, reported on some remarkable mathematical results from analyzing web links as a giant graph of nodes and links. This is not just interesting mathematics. It is behind the remarkable quality of Google and suggests methods to develop superior search engines.

These results also provide insight into the balance between searching and browsing. The web portal engineers are fascinated by how accurately the mathematical results fit the data that they have gathered about users'
search behavior.

Policy and legal

A panel of Jim Pitkow (Xerox PARC), Lee Giles (NEC), Pam Samuelson (UC Berkeley) and Geoff Nunberg (Xerox PARC) discussed the legal and policy issues of collecting open-access digital materials, including broadcast television. There are considerable doubts about privacy, copyright and related questions. As a result, important information is being discarded and access for researchers constrained.

Sam Gustman who heads the Survivors of the Shoah archive discussed an interesting example of these problems. This archive has recorded 50,465 interviews with survivors of Nazism (Jews, gypsies and homosexuals). Technically this is a huge archive. The interviews are MPEG-1 video at 3 mbits/second. The test system, with its private network to 5 locations (museums) around the world, has access to a 180-terabyte tape archive with disk caches. Because of the sensitivity of the material, the aim is to discourage the extraction and use of sound bites out of context. Internet 2 may prove useful because rules and policies are enforceable within a closed community.

Several speakers mentioned the special legal position of the Library of Congress. Deanna Marcum (CLIR) discussed the concept of "certified" archives. These archives would collect materials for the long-term following federal regulations and receiving legal protection, perhaps as agents of the Library of Congress.

Other presentations

Other sessions described recent advances in technical areas (compression, information retrieval, automatic indexing, natural language processing). Richard Wright (BBC) gave an interesting account of the BBC Archives. There was a session on rich media (e.g., video) and Steve Lawrence (NEC) talked about the state of the art in "automatic librarianship", on the theme that "the best is the enemy of the good."

Overall, these presentations emphasized how much progress has been made during the past few years in collecting, organizing and providing access to vast collections of digital materials through automated means.

Implications for the Library of Congress

The Internet Archive and its collaborators are a remarkable resource for scholarship. The technical expertise and ability to move quickly are superb. At the same time, they would gain immensely from partnership with the Library of Congress and other major cultural organizations. Here are several recommendations:

  1. Scholarly research and education. The archives are a resource for scholarly research in almost every aspect of humanities and social sciences. The academic community, as a whole, needs an outreach program to help scholars understand the resources that are available and to develop the tools that will help them analyze them.
  2. Certification of archives. The Library should develop a legal and regulatory framework that enables archives (such as the Internet Archive) to be its partners in preserving digital information for future generations.
  3. Selective collection. This colloquium concentrated on bulk collection of huge amounts of information with very basic organization. It needs a parallel activity in which selected categories of web information are collected more carefully and organized for scholarship.
  4. Technical. The Library of Congress should recognize that it lacks technical strengths in these areas and must rely on partnerships with other organizations. For selective collection and scholarly research the Library will need to support the standard environment (commodity PCs plus Linux and Perl).

William Y. Arms
March 11, 2000