The Internet Archive Colloquium brought together 30 to 40 people with interests in the web as an object of study. They included members of the Internet Archive, search engine developers, theoretical computer scientists, researchers in information retrieval and multi-media, policy experts and digital library specialists. Almost all the top people in these fields were present and we had an enthralling two days. For example, the chief technical people from Google, Yahoo!, Alexa and Infoseek gave technical presentations and were remarkably frank about their systems.
See http://www.archive.org/ for the agenda and the list of attendees. This report highlights some of the most important topics from the perspective of the Library of Congress.
The foundation for the meeting was to describe the activities of the Internet Archive. Brewster Kahle described the current situation. The archive has been collecting the open-access html pages on the web since 1996, approximately monthly. The current holdings are 11 terabytes on tape, 2.5 terabytes on disk. The data gathering is carried out by Alexa, which is a commercial company. Alexa donates its data to the archive when it is six months old. They are currently setting up all the data on disk for researchers.
Several other organizations sweep the web regularly. For example, Google caches all pages once per monthly. Sadly, because of copyright concerns, this data is discarded at the end of the month. The logs of usage maintained by Yahoo! and others are also valuable data collections, but access is strictly controlled for reasons of privacy (and competition).
A standard technical environment has emerged for large-scale Internet processing and
research. Everybody uses commodity PC hardware with Linux. Perl is the standard
programming language, for all activities except
the performance-critical parts of the systems.
Larry Page of Google gave the following data. Google has 85 people, of whom
half are technical and 14 have a Ph.D. in computer science. The central system handles 5.5
million searches daily, increasing 20% per month. They have 2,500 PCs running Linux, with
80 terabytes of spinning disk, and install an average of 30 new machines per day. The
cache holds about 200 million html pages. The aim is to crawl the web once per month.
Udi Manber (chief scientist of Yahoo!) mentioned that Yahoo! has 100,000,000 registered users and dispatches 1/2 billion pages to users per day.
A group of theoretical computer scientists, including Jon Kleinberg from Cornell, reported on some remarkable mathematical results from analyzing web links as a giant graph of nodes and links. This is not just interesting mathematics. It is behind the remarkable quality of Google and suggests methods to develop superior search engines.
These results also provide insight into the balance between searching and browsing. The
web portal engineers are fascinated by how accurately the mathematical results fit the
data that they have gathered about users'
search behavior.
A panel of Jim Pitkow (Xerox PARC), Lee Giles (NEC), Pam Samuelson (UC Berkeley) and Geoff Nunberg (Xerox PARC) discussed the legal and policy issues of collecting open-access digital materials, including broadcast television. There are considerable doubts about privacy, copyright and related questions. As a result, important information is being discarded and access for researchers constrained.
Sam Gustman who heads the Survivors of the Shoah archive discussed an interesting example of these problems. This archive has recorded 50,465 interviews with survivors of Nazism (Jews, gypsies and homosexuals). Technically this is a huge archive. The interviews are MPEG-1 video at 3 mbits/second. The test system, with its private network to 5 locations (museums) around the world, has access to a 180-terabyte tape archive with disk caches. Because of the sensitivity of the material, the aim is to discourage the extraction and use of sound bites out of context. Internet 2 may prove useful because rules and policies are enforceable within a closed community.
Several speakers mentioned the special legal position of the Library of Congress. Deanna Marcum (CLIR) discussed the concept of "certified" archives. These archives would collect materials for the long-term following federal regulations and receiving legal protection, perhaps as agents of the Library of Congress.
Other sessions described recent advances in technical areas (compression, information retrieval, automatic indexing, natural language processing). Richard Wright (BBC) gave an interesting account of the BBC Archives. There was a session on rich media (e.g., video) and Steve Lawrence (NEC) talked about the state of the art in "automatic librarianship", on the theme that "the best is the enemy of the good."
Overall, these presentations emphasized how much progress has been made during the past few years in collecting, organizing and providing access to vast collections of digital materials through automated means.
The Internet Archive and its collaborators are a remarkable resource for scholarship. The technical expertise and ability to move quickly are superb. At the same time, they would gain immensely from partnership with the Library of Congress and other major cultural organizations. Here are several recommendations:
William Y. Arms
March 11, 2000