
A Proposal to the Library of Congress
from Cornell University
Principal Investigator:
William Y. Arms
Department of Computer Science
wya@cs.cornell.edu
January 12, 2000
AbstractThis is a proposal to work with the Library of Congress to initiate a broad program to collect and preserve open-access materials from the World Wide Web. The effort will include consensus building within the Library, joint planning with external bodies, studies of the technical and policy issues, the development of a long-term plan and coordination of prototypes. The aim is to identify what can be done immediately and move rapidly through prototype into production in these areas. Meanwhile an agenda will be developed to encourage future research and development by the library and computing communities.
Electronic collections and the Library of CongressA prime function of the Library of Congress is to collect the cultural and intellectual output of today for the benefit of future generations. An ever-increasing amount of this material is being created in digital formats and does not exist in any physical form. Such materials are colloquially described as "born digital". Current genres that will be important to future scholars include online newspapers, electronic journals, web sites for special events, and electronic correspondence. Many, such as scientific databases and interactive web sites, have no manifestation except the digital form. No realistic estimate of the volume of born digital materials has been made. They are clearly growing very fast.
The Library of Congress is planning a broad program to collect and preserve born digital materials. Since these materials are highly diverse, the program will have many facets. This proposal addresses one part of this program, the open-access materials on the World Wide Web. These materials are currently estimated to comprise about 450 million pages of information. They range from unique sources of high-quality information, through interesting ephemera, to complete junk. Collectively they represent one of the most fascinating cultural creations of this century. Without conscientious efforts by libraries and archives most of the material will soon be lost. Some have already been lost.
Collection development strategiesFor born digital materials there is little distinction between acquisitions and preservation. From the moment that digital materials are acquired, continuing efforts are needed if the content is not to be lost through decay of media or technical obsolescence.
However, there is a choice to be made about access. In collecting web materials, the Library can choose to mount materials on web servers and provide access to them (and also to restrict access to unauthorized users). As a lower cost alternative, the Library can collect the raw materials and preserve them for future scholars, but not offer current access.
There are three main strategies for collecting and preserving open-access web material. The Library needs to pursue all of them.
Partnership with publishersMany important web materials are carefully managed by an existing organization, usually a publisher. Open-access examples include electronic journals, such as the Journal of Electronic Publishing, and scholarly collections, such as the Legal Information Institute at the Cornell Law School. In these cases, the Library and the nation are best served by the publisher being responsible for preservation in the medium-term. The Library's innovative agreement with UMI is an example. UMI maintains a high quality archive of theses and dissertations that serves as the national archive of these materials.
Agreements in which the publisher maintains information have many advantages. As computing systems change, publishers can be expected to migrate the information to new media, formats, and operating systems. The Library needs to monitor the quality of the preservation, and be protected against difficulties such as corporate bankruptcy, but the resource commitment is much less than managing a duplicate copy of the material at the Library.
Preservation of actively managed materials by building relationships with publishers is beyond the scope of this effort.
Selective collectionA central part of the effort covered by this proposal is to begin the process of collecting and preserving selected open-access web sites. While there are many uncertainties, the area is sufficiently well understood that prototype collecting can begin almost immediately leading rapidly to a mainstream activity of the Library.
The materials to select first are those where the creator or publisher makes no commitment to maintenance and preservation. For example, there are 5,000 online newspapers, few of which are preserved systematically. Some web sites are for special events, such as campaign web sites created by political candidates. Others are journals for which there are no paper versions, databases, and web sites of cultural or historical importance. Probably every curatorial division of the Library can identify important examples that should be preserved. Future generations may treasure IBM's 1996 Olympic Games web site with the affection that we now hold for films of Jessie Owens' triumphs in Berlin. The political web sites for the 2000 elections may prove as interesting to history as the films of John Kennedy's campaign speeches.
The Library will have to select the most important of these materials, acquire copies and manage them in its own repositories, or in cooperation with other libraries and archives. This requires processes for selecting, acquiring and long-term preservation. New practices of librarianship will be introduced and existing practices adapted. Continuing access to these materials will require appropriate levels of cataloguing and indexing.
Bulk collectionNo library has enough skilled professionals to select and collect individually more than a small proportion of the open-access web materials. Therefore, the second central part of this effort is to work with the Library and other partners to develop processes for preserving huge portions of the open access web using automatic methods. It is not practical to offer current access to all these material nor to provide normal catalogs or indexes, but basic indexing and preservation can be carried out by automatic methods.
Recently a number of organizations have demonstrated that huge amounts of material can be managed entirely by automatic means. While the quality of selection and indexing that are possible automatically are poor, the costs are so low that it is possible to manage vast quantities of material and preserve them for future generations. The web search engines illustrate the current state of technology. The quality of their indexing can not even be compared with professionally prepared catalogs and indexes, but they are able to provide crude access to huge amounts of information at very low cost.
The Internet Archive has demonstrated that it is technically and economically feasible to dump all the open-access web materials onto magnetic media. While it is impossible to preserve a functioning copy of the web, it is possible to preserve the essence of a vast number of web materials in a manner that future scholars can analyze and study. The archive has donated a snapshot of the web to the Library of Congress and hopes to establish a long-term relationship.
Description of project Planning and consensus buildingThe highest priority is to develop a shared understanding within the Library of Congress of the importance of web materials and the techniques that can be used to collect and preserve them. Members of the Library, from senior management down, need to appreciate the opportunities for the Library and also the risks if no action is taken. Collecting and preserving born digital materials, including the open access web, has implications for almost every division of the Library.
This effort will hold a series of meeting within the Library to develop a shared vision for collecting and preserving web materials. Visiting speakers will be invited to some of these meetings.
The Library can be expected to develop the librarianship skills to manage born digital materials and to play an international role in spreading the skills to other libraries. However, the Library is not a research organization and many of the computing methods for collecting and preserving web materials are still in a state of flux. In particular, specialized expertise is needed to manage very large collections by automatic means. The effort will hold at least one workshop at the Library that will bring together the leaders in this field to share expertise and build partnerships with the Library.
Prototypes and demonstration projectsThe effort will act as advisor to the Library's staff in developing prototype collections of selected web sites. High priority should be given by the Library to beginning these prototype collections as soon as possible.
Prototypes serve several purposes. They establish the nucleus of long-term collections; they demonstrate what can be achieved; they build staff experience and expertise. The success of the National Digital Library Program would have not been possible without the experience of American Memory, which itself drew on expertise developed through earlier conversion projects.
Policies for collecting open access materialsThe Library needs guidelines that describe its policies for open-access web materials. This effort will work with the relevant experts at the Library to develop these guidelines.
Under the law of mandatory deposit, the Library of Congress can demand two copies of all works published in the United States; the law is interpreted as applying to born digital materials. In rare cases, it may be necessary for the Library to be assertive in demanding important materials for its collections, but this is undesirable. One reason for an early emphasis on open-access materials is that their creators have deliberately made them available to the public, but even for these there are subtle legal and policy considerations.
Statistics and cost projectionsCollecting born digital materials will require substantial long-term investments. Within a few years, they could constitute a significant proportion of the Library's total budget. The Library needs good data to use in developing its policies, building its expertise and justifying its budgets. Therefore, this effort will establish baselines statistics for the size of the open-access web, the proportion of materials of interest, the rate of growth, and the costs of various strategies for collecting, access and preservation.
Technical planningTechnically, collecting and preserving born digital materials appear to be forbidding tasks. Potentially, the Library could be expected to collect materials in any format, including programs that run on any system and databases structured in any bizarre fashion. Metadata provided by creators can follow any standard or none.
Fortunately, the situation is not as bad as it seems. Many of the most important materials are actively managed by some external organization, whose self-interest is served by preservation and continual migration to new technology. Other information is designed for delivery to the public and therefore follows standards that are compatible with widely distributed software, such as web browsers. Preservation of these materials is a significant task, but at least the technology that must be accommodated is well supported.
The principal approaches to preservation are refreshing bits by copying them to different media and migration of content from one format and computing system to another. A specialized form of migration is to emulate components of a computing system. The Library will need to use all these methods. Migration of content is essential for all collections where continuing access is provided. For bulk collections, full migration of content is prohibitively expensive and probably not feasible, but selective migration can be carried out automatically.
Migration will never be completely effective. Inevitably some important materials will pose severe technical problems. For them there is no single, simple answer. Preserving some may be subcontracted to specialists; some may be preserved in a downgraded form; some creators may be prepared to provide archival versions; and some will inevitably be lost.
A collection of formats and protocolsThe web is based on a few simple formats and protocols, but there are numerous variations. Advanced web sites use specialized formats, mobile code and other techniques that are awkward for long-term preservation. One special collection that the Library should establish is a record of the technology of the web. Whatever technical methods are used for preservation, records of the specifications of formats, protocols and so on used by web sites will be required.
PlanThis proposal is for a 12-month effort, divided into two phases. The work will be carried out by William Arms who will contribute one month's effort in each phase.
The exact results that can be achieved within this 12-month period depend upon the rate at which the Library develops its program during this period. Planning can take place without new resources, but prototypes and production systems require that librarian and computing professionals be assigned to this area. The following schedule is based on the assumption that the Library will be ready to begin limited prototypes within 6 months.
Part 1 (six months)Part 2 (six months)1. Regular meetings with staff at the Library of Congress to build consensus and establish priorities
2. Select web sites for prototype collections, select technology, design prototype.
3. Meet with Internet Archive and host technical meeting at the Library of Congress to review tools for automatic management of bulk collections.
4. Gather current estimates, however rough, of the volume of open-access web materials that should be collected.
5. Work with the relevant experts at the Library to develop guidelines for collecting and preserving open-access web materials.
6. Provide an interim written report.
Coordination with the Library1. Continue internal meetings at the Library. Develop plans for collecting open-access web materials.
2. Advise the Library's staff on the implementation of prototype collections of web materials.
3. Design and coordinate initial prototyping of automatic management of bulk collections.
4. Prepare baseline cost estimates of the various options for collecting and preserving open-access web materials.
5. Complete initial policy guidelines.
6. Provide a final written report.
Success of this effort depends upon close coordination with the Library of Congress. A member of the Library should be appointed as principal point of contact, preferably a senior librarian who will have continuing responsibility for the Library's programs in these areas.
wya
Last revised: March 7, 2000