Information • Entertainment • Opinion (Since 1985)
|Book Fairs||Book Auctions||Open Bookshops||Biblio Paradiso (The Virtual Book Fair)||Book Search||Rare & Unusual Books||Alternative News|
The Great Library of Alexandria was the largest and probably the most important library of the ancient world. Its mandate to gather all of the world’s knowledge in one place was carried out by a vigorous acquisition program involving extensive book-buying trips around the Mediterranean. Prominent destinations for the curators of the Library were the well-known book fairs of Rhodes and Athens and in addition Egyptian officials were not shy about confiscating books on every ship arriving into port, keeping the originals and giving copies back to the owners. Over its lifetime it experienced several major disasters and was completely destroyed by fire in 642 AD, thereby erasing centuries of recorded knowledge and history.
Today, similar efforts are underway to assemble and maintain a Modern Great Library, and serious thought is being put into how to prevent it from ultimately suffering a fate similar to that of the Library of Alexandria. Google has estimated there are about 130 million unique books in the world, and indicates it intends to digitize all of them by 2020. Most of the scanned books are no longer in print or commercially available.¹
The process of digitizing a book is a laborious many-step affair. First, a camera takes a picture of a page and then the words on the page are recognized as combinations of ASCII characters as opposed to simply images. Finally, the content is stored in a compact yet efficiently searchable digital format. The ASCII conversion process is performed in several stages. The first stage uses Optical Character Recognition (OCR) software, and as sophisticated as OCR software is, it is still unable to correctly translate all of the words on a page. And so, the next stage involves using humans to decipher text which is untranslatable or unreadable by the OCR method. What most internet surfers do not realize is that they are unwitting participants in this stage of the project.
In 2006, a team of researchers from Carnegie Mellon developed software tools called reCaptcha. Captcha stands for “completely automated public Turing test to tell computers and humans apart” and is a process familiar to anyone who surfs the internet. Before gaining access to a secure area of the World Wide Web, a surfer is sometimes presented with hazy, distorted letters and asked to transcribe them. These “Captchas” are supposed to be only readable by humans, ensuring that computer robots do not hack secure websites. The genius of the Carnegie Mellon team was to harness this uniquely human effort and press it into the service of assisting OCR – and this assistance is not insignificant. Around the world, approximately 200 million Captchas are being performed every day and, happily, this approximately equals the error rate for the one billion words that are daily being digitized by Google in its book scanning project!
Besides the Google book project, there are other major efforts underway to create and maintain a permanent Modern Great Library and one organization looking at the problem is the Long Now Foundation. According to their website, “The Long Now Foundation was officially established in 1996 to develop the 10,000-Year Clock and 10,000-Year Library projects as well as to become the seed of a very long term cultural institution.” The Long Now Foundation and Stanford University Libraries held a conference to work specifically on the establishment of the 10,000-Year Library and among the attendees were Michael Keller (Stanford University librarian), Elizabeth Niggeman (head of the German National Library), cuneiform expert William Hallo of Yale University, and Brewster Kahle (creator of the Internet Archive). Conference participants discussed the need to insure permanence of information and the importance of long term planning. In the words of Keller:
Stewardship of cultural content is the essential role of research libraries. Serious players in this field have always collected, organized, and preserved information – OK, books, mostly – on behalf of future generations, but up to now, we haven't really thought seriously about how many such generations, or how to think about the mission in terms of thousands of years. Digital information technologies, with their notorious instability, force us to reassess how we go about fulfilling this mission hereafter. So we are an interested party. But nobody knows what the important questions are, to say nothing of solutions.
The Long Now Foundation has developed the Rosetta Disk to assist historians of the distant future in deciphering the current written record. The Rosetta Disk is a 2 inch diameter nickel disk microscopically etched with analog text at a scale to be read by powerful microscopes. Approximately 350,000 pages can be stored on a single disk and the disks have life expectancies of up to 10,000 years. Assuming Google’s 130 million books estimate is accurate and using a figure of 200 pages per book, approximately 75,000 Rosetta Disks would be required to preserve and store all of the books currently in existence. These Rosetta Disks remind me of the ancient Sumerian clay tablets found in places like Kish and Ebla (which are still readable!). Although the invention and use of papyrus eventually displaced the clay tablet, there are certainly archival advantages to more robust media like clay.
A two-pronged archival strategy has been proposed by Doug Carlson of Broderbund Software. Any great archive should have two versions of everything, one featuring fast record/read capability and the other slow copying/retrieval. The fast version would be similar to the current World Wide Web, used day to day, and somewhat ephemeral. The slow version would be more like the Rosetta Disks, difficult to create and gain access to, but also very difficult to destroy. He proposes a continual process of migration from fast to slow versions. Periodically, the fast version would be compared to the definitive slow archived version, and corrected if necessary. In fact, this process is similar to life on Earth: long-lived information is encoded in the sturdy DNA while the ephemeral RNA is used day-to-day, periodically corrected and recreated by comparing it back to the DNA.
If the best solution is to combine the long life of a robust (probably analog) medium with the cheapness/speed of digital storage/retrieval, we must carefully answer the following question: which physical media are permanent and which are ephemeral? For instance, as Daniel Hillis, supercomputer designer and co-chair of the Long Now Foundation puts it, “Is the Net…profoundly robust and immortal, or is it the most ephemeral digital artifact of all?” [i.e. 'cloud computing'] Hillis has talked about the possibility of a future “Digital Dark Age” due to a lack of digital permanence. And according to Stewart Brand, also of the Long Now Foundation, "Vast archives of digitized NASA satellite imagery of the Earth in the 1960s and 1970s – priceless to scientists studying change over time – now reside in obsolete, unreadable formats on magnetic tape”.
Preservation strategies must also reflect the magnitude of what we deem historically significant. What are we recording today that future generations will want to know? Should we preserve only our finished literary works? What about all our twittering minutiae? Future use of our Modern Great Library may resemble the searching for quantum needles in a holographic haystack and search strategies might well include the use of quantum computing algorithms. Maybe one day this store of information will actually supplant our physical world a la Wheeler’s “it from bit.” Who knows?
I think we will be pondering questions like these for a long, long time.
¹ Kelly, Kevin. Scan This Book! (New York Times Magazine, May 14, 2006) "When Google announced in December 2004 that it would digitally scan the books of five major research libraries to make their contents searchable, the promise of a universal library was resurrected. ... From the days of Sumerian clay tablets till now, humans have "published" at least 32 million (sic) books [note: depending on how one defines book, this figure seems much too low], 750 million articles and essays, 25 million songs, 500 million images, 500,000 movies, 3 million videos, TV shows and short films and 100 billion public Web pages."
John Howard Huckans is a professor of physics at Bloomsburg University of Pennsylvania whose work in ultra-cold atomic physics has been published in many scientific journals. He lives, with his family, in Light Street, PA