Oral History Repository

This section describes a proposal for a new process for collecting, processing, and presenting the LHC oral history collection.

Oral History Recording
Oral histories should be recorded straight to digital using a low-cost digital recorder. This has the advantage of being exceptionally clean (no tape hiss or low-frequency hum), no generation loss as the tape is digitized, and rapid uploading of the interview to the PC. The interview is 'born digital' and stays digital.

Post process recording
After an interview is conducted, the recording file is uploaded to the PC, which should take only few minutes for an hour-long interview. At this point, it can then be copied as much as is needed with zero loss of quality.

Once it's on the PC, the original file is stored in an 'untouchable archive' and a working copy is made and loaded into Audacity (open-source audio editor) for initial processing, which could include:

  • deleting cruft at the beginning and ending of the recording
  • deleting mistakes during the recording (keeping integrity in mind)
  • normalizing for volume
  • equalization to produce best sound.

These changes should be very simple, and (other than the deleting) could be scripted so they are made almost automatically.

Once a good working copy is saved it is copied to the web by creating a new content item in Drupal and entering the appropriate metadata (eg. Dublin Core items: title, number, description, author, etc. Most of this is on the present card - which is now in PastPerfect - for existing recordings). Once uploaded, the recording is accessible to anyone, and is part of the repository. (Note: it is possible to restrict access until it is officially 'released' to the public).

Total time to upload recording, process audio, and create content record: about 10-15 minutes with most of that time entering the metadata.

Create Transcript
Once the recording is on the web, the link is emailed to a transcriber, who could be working either in the office or at home, and the transcriber plays the audio in their favorite audio player. At the office, we'll use xmms, which is a popular open-source player that works on Linux. It has a very clear time display, and easy forward/backward controls operated by dragging a slider or forward/backward arrow keys. It can be paused as needed to transcribe. There are also players which will slow down the playing without changing pitch, if that is needed.

As the audio is being played, the transcription is typed into the transcribers word processor of choice. At the office this could be OpenOffice, or it could be a plain text editor - it doesn't matter. Formatting also doesn't matter, since it will be formatted in the content system, not on the transcriber's PC.

The time-stamp of the audio is noted as the transcription is being made, and put at the top of paragraphs.

Once the transcription is done, it is copied into the 'transcription' field in the content record on the web, where it is now available to anyone with access rights to that field.

Create Line Index
A summary of the transcript is made by reading the transcript and extracting the highlights along with their time-stamps. This line index should be only the essence, since the full transcript will be easy enough to skim or search. The line index is created in the word processor of choice, and is entered into the 'line-index' field in the content record.

At this point, the repository record is complete.

Existing Oral Recordings
For the 600 existing recordings, only the first step of this process is changed. The existing original tape is digitized by playing the tape on the Yamaha deck into the soundcard on the PC using Audacity as the recording software. Once this is done, the process is the same as above. For those records where we already have transcripts and linecards they can be copied directly into the content record.

Summary of Steps
Here's a brief summary of the process:

  1. Record straight to digital recorder
  2. Upload to PC, process audio
  3. Create content record on website, upload audio file
  4. Add transcript directly to website
  5. Extract line index from transcript

Thoughts and issues

I have put the creation of the line index after the transcript for two reasons: First, it will be so much quicker to create the line index visually from the transcript rather than playing the recording. Second, with the transcript being electronic and available from the web, it seems like the line index is of more limited use. One could easily scan the transcript, or use a search string. Therefore, the line index could be much less detailed - only hitting the real highlights.

With this web-based approach, it would be very easy to breakup the transcription job. If a transcriber can't finish, they just note the time-stamp where they finished. Using an actual minute:second time rather than the arbitrary counter on the recorder means anyone can pick up where they left off, and since their transcription work is on the web it's easy to add to it. It's also easy to edit/change it as needed. (With appropriate permissions, of course).

The transcription job could be delegated anywhere in the world very easily (eg. India).

Once the recordings are on the web it is a simple matter to make them available as podcasts for playing in portable MP3 players, or to clip excerpts for using in special 'exhibits' on the web. For example, a special web page on Eartha could automatically start playing excerpts of her audio when someone goes to the page.

We should be able to electronically import the 600 descriptive cards that have already been entered into PastPerfect. Then, we can add the digitized audio to these recordings as we make them.

Digitizing the existing recordings is really a background task. You insert a tape, press record on the software and play on the deck and come back in an hour. For this reason, it might be useful to setup a recording system outside Joan's office where others could run the process through the day as a background activity.

Of course, it's also possible to outsource this, but it would be very expensive and probably not the best use of LHC money.