Notes about collecting a Web site

Carl Fleischhauer
National Digital Library Program

April 21, 2000

Written in the middle (!) of the process, for my colleagues who may be curious. Any recipient should feel free to pass this along to interested parties.


We are collecting a "recorded sound" website as a part of the audio-visual repository prototyping project. In the initial phase of our project, we wanted to have a cross section of digital content that represents recorded sound collections:

Reformatted historical materials
Deteriorating items, selected because of condition
          Example: Marine Corps Combat Recordings from WWII
Items requested by researchers in Capitol Hill reading rooms, delivered from Culpeper
          Example: 12-inch long playing phonograph records
Born digital recorded sound
Web site described below

An informal group consisting of myself and staff of the Motion Picture, Broadcasting, and Recorded Sound Division looked for a web site with some of the following characteristics:

After a brief canvass of possible sites, we struck one that met most of our criteria: http://gorillasalad.com/sbryant/gsp/.   The site represents the work of Steven Bryant, a young Juilliard-trained composer with an interest in serious and (apparently) rock music.


We collected the site on April 17, 2000, when only five of eight compositions were available. (Some of the eight were present in multiple variants, so the menu lists fourteen files.) The remaining recordings appear to be on a now-unavailable Juilliard server. At first this seemed like a setback but then we rationalized the outcome by saying, "Well, isn’t that just a good representation of what a web site is like." On April 15, Bryant added a dated note to his recordings menu page that reported that he knew that some files were unreachable and that he was working on the problem.

I began collecting the site by hand, opening each page and saving to my hard disk. This slow process quickly became very tedious. For example, many of the Gorilla Salad navigation "bars" are made up of separate bit-mapped typographic images, where even the vertical separator lines are images of their own. And the picture gallery has 62 image files, mostly JPEG pictures of the composer, his studio, at performances, and with friends. Remembering the mention of web-collecting software in the talk by the National Library of Australia visitors, Dick Thaxter helped me find the shareware site TuCows.com, where we downloaded a French freeware package called HTTrack which received the estimable "four cows" rating. Here’s the rest of the TuCows blurb about HTTrack:

Version Number: 2.01
Revision Date: March 22nd, 2000
Byte Size: 1.8 MB
License: Freeware
Evaluation Period: Unlimited
Home Page: http://httrack.free.fr
Note: Click here for win95/98 patch for version 1.20
Description: HTTrack is an easy-to-use offline browser utility. It allows you to download a World Wide website from the Internet to a local directory, building recursively all directories, getting html, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" website in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site, and resume interrupted downloads. HTTrack is fully configurable, and has an integrated help system.

Well, HTTrack worked like a miracle. It downloaded the whole site (save the recordings) in one minute or so. I was amazed! Fearing that it would be slow, I tried to use the software’s filter feature to avoid grabbing the 62 JPEG images, which I had previously downloaded one at a time. But my settings failed and the JPEGs came across again, but very quickly. HTTrack’s default settings do NOT download any links that include "http" and, since Gorilla Salad listed every music link (even ones on the local server) as an "http," none of the music was captured. This was fine because I had also downloaded the MP3 and AIFF files by hand a few days earlier.

The net result is this (imagine a MARC 300 field): 49 HTML files; 5 PDF files; 1 MS-WORD file; 3 MP3 files; 2 AIFF files, 62 image files (JPEG and GIF).


I found myself modifying two of the HTML files, keeping copies of the originals unchanged for preservation, natch. On the home page, I removed a few lines of code that invoked a site-visit counter service (at counter.com, or is it mycomputer.com?) and added a visible line at the bottom of the page: "Experimental preservation copy collected by the Library of Congress Audio-Visual Project, April 17, 2000." On the menu page for the music files, for the five files that I had succeeded in collecting, I changed all of the <a href> links from http pointers to links to the local directories that held the files. (The Gorilla Salad site could have done that; their links used an "unnecessary" http.) On the same menu page, for the files that I was not able to download, I removed the non-working http links to avoid 404 errors and added the display comment: "[Not available in LC preservation copy]."

When capturing the pages, HTTrack adds this comment at the top:
<!-- Mirrored from gorillasalad.com by HTTrack/2.x [XR/YP'2000]-->
This seemed like rather a good thing and, on the two pages I modified, I expanded the comment in this way:
<!-- Mirrored from gorillasalad.com by HTTrack/2.x [XR/YP'2000] by M/B/RS-DLC April 17, 2000-->
<!-- Modified for preservation experiment by M/B/RS-DLC April 18, 2000-->

As we carried out this experiment, Nancy Seeger and Mary Bucknum of the Motion Picture, Broadcasting, and Recorded Sound Division recorded sound section and I talked about ideas for cataloging, consulting Ardie Bausenbach and planning to consult Dave Reser. Our tentative idea is that the bibliographic record ought to describe our frozen preservation copy and not the living online site. That is, we plan to keep the site as captured this year (we may recapture when the missing audio files reappear) and may recapture the site a few years hence, when the new frozen site would be cataloged anew. Nancy has started in to sketch out a record for us to consider and discuss.

What are we to make of all this? Well, it sure has been a lot of work to collect and–as the Australians reported–there seems to be fair amount of work afoot to get the site into the right state for our purposes, i.e., modifying some pages, adding comments, etc. And we have not yet tried to capture our repository metadata for all of the files. That will be a chore unless our local systems or TEAMS software can do a good job of automatic capture. I can imaging getting some kinds of data easily but Bryant seems to have only assigned titles (<TITLE></TITLE>) from time to time, making it a little harder to auto-extract page by page descriptive or labeling metadata. And I am not quite clear how to get at the data about the page relationships and relationships to images (<img src>) to be inserted into HTML pages; all of which may fall within the meaning of "linking" in TEAMS software.

No matter. We are inspired by the late Oat Willie, the Austin, Texas, cartoon character who spoke the immortal words, "Onward, through the fog!" Stay tuned and send your war stories.