Warsystems— The quest for the universal library

GOOGLE’S MOON SHOT

by JEFFREY TOOBIN
The quest for the universal library.
Issue of 2007-02-05
Posted 2007-01-29

Every weekday, a truck pulls up to the Cecil H. Green Library, on the campus
of Stanford University, and collects at least a thousand books, which
are taken to an undisclosed location and scanned, page by page, into an
enormous database being created by Google. The company is also
retrieving books from libraries at several other leading universities,
including Harvard and Oxford, as well as the New York Public Library.
At the University of Michigan, Google’s original partner in Google Book
Search, tens of thousands of books are processed each week on the
company’s custom-made scanning equipment.

Google intends
to scan every book ever published, and to make the full texts
searchable, in the same way that Web sites can be searched on the
company’s engine at google.com. At the books site, which is up and
running in a beta (or testing) version, at books.google.com, you can
enter a word or phrase—say, Ahab and whale—and the search returns a
list of works in which the terms appear, in this case nearly eight
hundred titles, including numerous editions of Herman Melville’s novel.
Clicking on “Moby-Dick, or The Whale” calls up Chapter 28, in which
Ahab is introduced. You can scroll through the chapter, search for
other terms that appear in the book, and compare it with other
editions. Google won’t say how many books are in its database, but the
site’s value as a research tool is apparent; on it you can find a
history of Urdu newspapers, an 1892 edition of Jane Austen’s letters,
several guides to writing haiku, and a Harvard alumni directory from
1919.

No one really knows how many books there are. The
most volumes listed in any catalogue is thirty-two million, the number
in WorldCat, a database of titles from more than twenty-five thousand
libraries around the world. Google aims to scan at least that many. “We
think that we can do it all inside of ten years,” Marissa Mayer, a
vice-president at Google who is in charge of the books project, said
recently, at the company’s headquarters, in Mountain View, California.
“It’s mind-boggling to me, how close it is. I think of Google Books as
our moon shot.”

Google’s is not the only book-scanning
venture. Amazon has digitized hundreds of thousands of the books it
sells, and allows users to search the texts; Carnegie Mellon is hosting
a project called the Universal Library, which so far has scanned nearly
a million and a half books; the Open Content Alliance, a consortium
that includes Microsoft, Yahoo, and several major libraries, is also
scanning thousands of books; and there are many smaller projects in
various stages of development. Still, only Google has embarked on a
project of a scale commensurate with its corporate philosophy: “to
organize the world’s information and make it universally accessible and
useful.”

In part because of that ambition, Google’s
endeavor is encountering opposition. A federal court in New York is
considering two challenges to the project, one brought by several
writers and the Authors Guild, the other by a group of publishers, who
are also, curiously, partners in Google Book Search. Both sets of
plaintiffs claim that the library component of the project violates
copyright law. Like most federal lawsuits, these cases appear likely to
be settled before they go to trial, and the terms of any such deal will
shape the future of digital books. Google, in an effort to put the
lawsuits behind it, may agree to pay the plaintiffs more than a court
would require; but, by doing so, the company would discourage potential
competitors. To put it another way, being taken to court and charged
with copyright infringement on a large scale might be the best thing
that ever happens to Google’s foray into the printed word.

Though
Google has more than ten thousand employees—about fifty new ones are
hired each week—and a market capitalization of more than a hundred and
fifty billion dollars, the company cultivates the air of a college
campus at its headquarters, in Silicon Valley. Now and then, there are
self-consciously wacky stunts, like Pajama Day, which happened to take
place when I visited. (The event was to be madcap within reason;
supervisors were told to convey the message that “pajamas means
‘pajamas,’ not ‘what you sleep in.’ ”) When I met with Sergey Brin, a
co-founder of Google, he was wearing bright-blue p.j.s, with the
company’s logo stitched on the breast pocket.

The story
of how Brin and Google’s other co-founder, Larry Page, met as graduate
students in computer science at Stanford in the mid-nineties, and
devised a series of elegant software algorithms that allowed Web
searchers to find relevant information quickly and efficiently, has
become part of Silicon Valley lore. Less well known is that, at the
time, Brin and Page were also working on Stanford’s Digital Library
Technologies Project, an attempt, funded by the federal government, to
organize different kinds of stored information, including books,
articles, and journals, in digital form. “There was an attitude in
computer science that putting things on dead trees was obsolete and
getting it all into a searchable, digital format was a quest that had
to be accomplished someday,” Terry Winograd, a Stanford professor who
was a mentor to Page and Brin, said.

After founding
Google, in 1998, Page and Brin—who are now in their mid-thirties and
worth around fourteen billion dollars each—began to talk about how to
include books in the company’s database. Page, in particular, embraced
the idea of putting books online; at one point, he set up a primitive
lab in his office, with a scanner and a page-turning machine. “I think
it was motivating to have those kinds of aspirations, but nobody really
took it seriously,” Brin told me. The men were less interested in
making it easy for people to obtain the full texts of books online than
in making accessible the information those books contained. “We really
care about the comprehensiveness of a search,” Brin said. “And
comprehensiveness isn’t just about, you know, total number of words or
bytes, or whatnot. But it’s about having the really high-quality
information. You have thousands of years of human knowledge, and
probably the highest-quality knowledge is captured in books. So not
having that—it’s just too big an omission.” As Marissa Mayer put it,
“Google has become known for providing access to all of the world’s
knowledge, and if we provide access to books we are going to get much
higher-quality and much more reliable information. We are moving up the
food chain.”

In 2002, Google quietly made overtures to
several libraries at major universities. The company proposed to
digitize the entire collection free of charge, and give the library an
electronic copy of each of its books. “Larry is an undergrad alum here
at Michigan, and he knew we were already interested in digitizing the
library as part of our preservation efforts,” John Wilkin, an associate
university librarian at Michigan, told me. “There was a lot of
back-and-forth between Google and us in the process. We wanted to
insure that the materials wouldn’t be damaged and that what came out
could be used as a preservation surrogate. They started experimenting
with different ways of copying the images, and we started a pilot
project in July, 2004. We’ve been getting better, going faster. We’re
doubling our output all the time.” The Michigan library holds seven
million volumes, and Wilkin believes that Google will have copied the
entire collection in about six years.

Last
month, at the New York Public Library, Google hosted a conference on
the future of the publishing industry. About four hundred people—mainly
publishing executives and agents—attended, most of them grimly aware of
the simultaneous lethargy and panic that have characterized their
industry’s response to the digital age. Nearly all attempts to sell
books in an electronic format have been disappointing, and now Google
appeared to be encroaching on the publishers’ domain. The implicit
message of the conference was summed up by a quotation from Charles
Darwin that was projected on a screen: “It is not the strongest of the
species that survive, nor the most intelligent, but the ones most
responsive to change.” As Laurence Kirschbaum, a longtime publishing
executive who recently became a literary agent, told me at the
conference, “Google is now the gatekeeper. They are reaching an
audience that we as publishers and authors are not reaching. It makes
perfect sense to use the specificity of a search engine as a tool for
selling books.”

Google thought so, too, and designed the
books project accordingly. In addition to forming partnerships with
libraries, the company has signed contracts with nearly every major
American publisher. When one of these publishers’ books is called up in
response to search queries, Google displays a portion of the total work
and shows links to the publisher’s Web site and online shops like
Amazon, where users can buy the book. “We are helping the publishers
reach consumers that otherwise might not have known about their books
and helping them market their books by giving limited but relevant
previews of the books,” Jim Gerber, Google’s director of content
partnerships, told me. “The Internet and search are custom made for
marketing books. When there are a hundred and seventy-five thousand new
books each year, you can’t market each one of those books in mass
market. When someone goes into a search engine to learn more about a
topic, that is a perfect time to make them aware that a given book
exists. Publishers know that ‘browse leads to buy.’ ” (Google says that
it does not take a cut of sales made through its books site.)

Still,
on October 19, 2005, several leading publishers, including Simon &
Schuster, the Penguin Group, and McGraw Hill—all of which are partners
in Google Book Search—filed a lawsuit against the company, seeking to
stop the project. The publishers don’t object to Google’s plan for
helping them sell new books, but they assert that the library component
of the project is illegal. They claim that Google’s “massive, wholesale
and systematic copying of entire books still protected by copyright”
infringes on the publishers’ rights. They demand that Google stop
further copying and “destroy all unauthorized copies made by Google
through the Google Library Project of any copyrighted works.” (The
Authors Guild filed its lawsuit around the same time.) The publishers,
who have the support of the Association of American Publishers, are
suffering from a version of the problem that John Kerry had in the last
Presidential campaign: they are for Google Book Search at the same time
that they are against it.

Copyright
law dates to the birth of the Republic. Article I of the Constitution
assigns Congress the right to pass laws “securing for limited Times to
Authors and Inventors the exclusive Right to their respective Writings
and Discoveries.” The first copyright law was passed in 1790, and it
has been frequently and confusingly amended over the years, most
recently in the Sonny Bono Copyright Term Extension Act of 1998, which
extended copyright terms by twenty years. (The law is also known as the
Mickey Mouse Protection Act, because the Walt Disney Company, seeking
to protect its copyright on early animated classics like “Steamboat
Willie,” lobbied heavily for it.) The twisted history of copyright law
has insured an awkward passage into the digital age.

The
legal assertion at the core of Google’s business plan is its purported
right to scan millions of copyrighted books without payment to or
permission from the copyright owners. Approximately twenty per cent of
all books are in the public domain; these include books that were never
copyrighted, like government publications, and works whose copyrights
have expired, like “Moby-Dick.” Google has simply copied such books and
made them available on the Web. Roughly ten per cent of books are
copyrighted and in print—that is, actively being sold by publishers.
Many of these books are covered by Google’s arrangement with its
publisher partners, which allows the company to scan and display parts
of the works.

The vast majority of books belong to a third
category: still protected by copyright, or of uncertain status, and out
of print. These books are at the center of the conflict between Google
and the publishers. Google is scanning these books in full but making
only “snippets” (the company’s term) available on the Web. (Google
searches turn up only the search term and about twenty words on either
side of it.) Copyright law has never forbidden all “copying” of a
protected work; scholars and journalists have long been allowed to
quote portions of copyrighted material under the doctrine of fair use.
Google maintains that the chunks of copyrighted material that it makes
available on its books site are legal under fair use. “We really
analogized book search to Web search, and we rely on fair use every day
on Web search,” David C. Drummond, a senior vice-president at Google
who is overseeing the response to the lawsuits, told me. “Web sites
that we crawl are copyrighted. People expect their Web sites to be
found, and Google searches find them. So, by scanning books, we give
books the chance to be found, too.” (Google also has an “opt out”
policy, which allows copyright holders to request that specific titles
be omitted from the company’s database.)

However,
according to the plaintiffs in the cases against Google, the act of
copying the complete text amounts to an infringement, even if only
portions are made available to users. “What they are doing, of course,
is scanning literally millions of copyrighted books without
permission,” Paul Aiken, the executive director of the Authors Guild,
said. “Google is doing something that is likely to be very profitable
for them, and they should pay for it. It’s not enough to say that it
will help the sales of some books. If you make a movie of a book, that
may spur sales, but that doesn’t mean you don’t license the books.
Google should pay. We should be finding ways to increase the value of
the stuff on the Internet, but Google is saying the value of the right
to put books up there is zero.”

Google asserts that its
use of the copyrighted books is “transformative,” that its database
turns a book into essentially a new product. “A key part of the line
between what’s fair use and what’s not is transformation,” Drummond
said. “Yes, we’re making a copy when we digitize. But surely the
ability to find something because a term appears in a book is not the
same thing as reading the book. That’s why Google Books is a different
product from the book itself.” In other words, Google says that being
able to search books on its site—which it describes as the equivalent
of a giant library card catalogue—is not the same as making the books
themselves available. But the publishers cite another factor in
fair-use analysis: the amount of the copyrighted work that is used in
the creation of the new one. Google is copying entire books, which
doesn’t sound “fair” to the plaintiff publishers and authors.
“Traditional copyright analysis says that a transformation leads to the
creation of a new and independent work, like a parody or a work of
criticism,” Jane Ginsburg, a professor at Columbia Law School, said.
“Copying the entire work, which is what Google is doing, does not
preclude a finding of fair use, but it does fall outside the
traditional paradigm.”

Harvard, Stanford, and Oxford
have prohibited Google from scanning copyrighted works in their
collections, limiting the company to books that are in the public
domain. Because of the opacity of copyright law, and the extension of
protections mandated by the 1998 act, it’s not always clear which works
are still protected. (Copyright status can become murky when authors
die or publishing houses go out of business.) Stanford has drawn a line
at 1964 and prohibited Google from copying most works published since
that date. “When Google got sued, we got nervous,” Michael A. Keller,
the university librarian at Stanford, told me. “We’re not a public
institution. We don’t have any state immunity from being sued
ourselves, so we started sorting out the stuff that we know is public
domain.” (Several of the public institutions that are Google’s
partners, including the Universities of Michigan, California, Virginia,
and Texas at Austin, are allowing the scanning of copyrighted material.)

The
chief engineer of Google’s system for scanning books in the library
collections is Dan Clancy, who joined the company after eight years at NASA,
where he supervised teams of Ph.D.s. working on problems related to
artificial intelligence. Google provides its employees with free food
twenty-four hours a day, and Clancy, a tall, shambling man with a shock
of white-blond hair, conducted most of our conversations with bits of
granola bar clinging to his shirt.

“Previously, when
people have done scanning, they always were constrained by their budget
and their scale,” Clancy told me. “They had to spend all this time
figuring out which were the perfect ten thousand books, so they spent
as much time in selection as in scanning. All the technology out there
developed solutions for what I’ll call low-rate scanning. There was no
need for a company to build a machine that could scan thirty million
books. Doing this project just using commercial, off-the-shelf
technology was not feasible. So we had to build it ourselves.”

Google
will not discuss its proprietary scanning technology, but, rather than
investing in page-turning equipment, the company employs people to
operate the machines, I was told by someone familiar with the process.
“Automatic page-turners are optimized for a normal book, but there is
no such thing as a normal book,” Clancy said. “There is a great deal of
variability over books in a library, in terms of size or dust or
brittle pages.” (To needle Google, several blogs have posted images
from the books site that include the scanners’ fingers.) Google will
not reveal how much it is spending on the books project. In 2005,
Microsoft announced that it would spend two and a half million dollars
to scan a hundred thousand out-of-copyright books in the collection of
the British Library. At this rate, scanning thirty-two million
books—the number in WorldCat’s database—would cost Google eight hundred
million dollars, a major but hardly extravagant expenditure for a
multibillion-dollar corporation.

Copying all those pages
presents many difficulties, but writing software to make the books
useful to searchers is even harder. “The scanning technology is
boring,” Clancy said. “The real challenge is to get somebody something
that they are actually interested in, inside a book. Web sites are part
of a network, and that’s a significant part of how we rank sites in our
search—how much other sites refer to the others.” But, he added, “Books
are not part of a network. There is a huge research challenge, to
understand the relationship between books.”

Still, the
basic search protocols function well. A search for “Heart of Darkness”
leads immediately to Joseph Conrad’s novel, which is not as obvious as
it sounds, considering how common the words in the title are. As Clancy
said, “If you put in ‘Heart of Darkness,’ we have to know that you’re
looking for the novel, not a book about lighting conditions in cardiac
surgery. So how do we do that? We rank some words more important than
others. The title may matter more than the content, so we may weight
that more. You could also look at what other people have searched for,
so if everyone who searched for ‘Heart of Darkness’ clicked on the
novel, we might figure that you probably will, too.”

The
most important data for ranking searches, Clancy explained, may come
from Web pages that link to books in Google’s database. (For instance,
if links on the phrase “Clinton’s autobiography” direct users to a copy
of “My Life” on the books site, there is a high probability that people
who use the same search terms will also want this result.) “We just
started, and we need to make these books networked, and we need people
to help us do that,” Clancy said.

Google’s database
contains many books in languages other than English, but for now they
must be searched in the original tongue. On the company’s Web site,
there is already a primitive translation feature, and it may someday be
enhanced to allow books to be rendered in another language at the touch
of a button. “In terms of democratization, you want to be able to
access information,” Clancy told me. In places like the Arab world,
where few titles are translated into the local languages each year, he
said, access to the world’s books could have a substantial impact. “We
are talking about a universal digital library,” Clancy went on. “I hope
this world evolves so that there exists a time where somebody sitting
at a terminal can access all the world’s information.”

Such
messianism cannot obscure the central truth about Google Book Search:
it is a business. Google has pledged not to show advertising next to
the pages of library books, but the company does sell advertising
alongside search results that lead to books obtained from publishers.
Google’s prospects for producing revenue from the books project appear
rather modest, but the company has often made a profit on ventures that
initially seemed unlikely to be lucrative. “We’ve had this fortunate
streak that when we’ve done things that have impacted our users and
society as a whole—positively, in a significant way—we’ve been rewarded
by that downstream in some way, even though we may not have envisioned
exactly what it was right offhand,” Sergey Brin told me. “We didn’t
have ads when we first put up Web search. It wasn’t clear it was great
business when we started search. In fact, the companies that were doing
search were moving away from it. But we just thought it was important,
and we thought that where there was a will there would be a way. And in
fact it turned out to be a great way to make money—doing search with
targeted advertising. And I think you’ll find the same sort of thing
here.”

The key legal question is whether the courts will allow
Google to continue to scan copyrighted material without permission. But
the schedule of the lawsuits may turn out to be as significant as the
merits of the cases, which are before Judge John E. Sprizzo. In keeping
with the stately pace of federal litigation, the depositions of
witnesses are to begin sometime this year, and the parties will be
allowed to file motions for summary judgment—in Google’s case, to
dismiss the suits—in early 2008. Then there could be a trial. If the
cases are appealed, they could linger well into the next decade.

However,
most people involved in the dispute believe that a settlement is
likely. “The suits that have been filed are a business negotiation that
happens to be going on in the courts,” Marissa Mayer told me. “We think
of it as a business negotiation that has a large legal-system component
to it.” According to Pat Schroeder, the former congresswoman, who is
the president of the Association of American Publishers, “This is
basically a business deal. Let’s find a way to work this out. It can be
done. Google can license these rights, go to the rights holder of these
books, and make a deal.”

The terms of such a deal aren’t
hard to imagine. The Authors Guild is concerned that pirated copies of
the books on Google’s site could leak to the public, and so the
organization would insist on security measures. (Sadly, for writers and
publishers, demand for their products has never been robust enough to
generate a major piracy problem.) As for distribution of the proceeds
from the site, Google might agree to share revenue with publishers, in
the way that radio stations pay for the music they play; publishers
could receive a fee based on a statistical analysis of how often their
books are viewed. Google could pay in cash or in kind, with
advertising.

But a settlement that serves the parties’
interests does not necessarily benefit the public. “It’s clearly in
both sides’ interest to settle,” Lawrence Lessig, a professor at
Stanford Law School, said. “Businesses in Internet time can’t wait
around for years for lawsuits to be resolved. Google wants to be able
to get this done, and get permission to resume scanning copyrighted
material at all the libraries. For the publishers, if Google gives them
anything at all, it creates a practical precedent, if not a legal
precedent, that no one has the right to scan this material without
their consent. That’s a win for them. The problem is that even though a
settlement would be good for Google and good for the publishers, it
would be bad for everyone else.”

Libraries
have recognized for some time that they must adapt to the digital age,
and many have taken steps in that direction. In 1995, Stanford founded
the HighWire Press, which now provides electronic access to more than a
thousand scholarly journals. A few years later, Stanford digitized most
of its card catalogue, and circulation of its books increased by fifty
per cent. “Once our students could sit in their dorm rooms and find out
what we had in the library, they sought out more books,” Michael
Keller, the university librarian, says. Individual libraries sometimes
received grants to scan specific collections—in 2001, the New York
Public Library used federal money to digitize a substantial portion of
the collection at its Schomburg Center for Research in Black
Culture—but a comprehensive effort seemed inconceivable. According to
Paul LeClerc, who has been the president of the New York Public Library
for the past thirteen years, “For the first decade of my tenure, I was
always asked, ‘Weren’t libraries going to go online?’ And I’d say of
course we want to do it, but it’s not going to happen, because no one
is going to give us the money to do it. Nowhere on the horizon was that
amount of money predictable or identifiable. Then came Google. This
struck us as being the quickest, the fastest, and the most efficient
way of getting large-scale additions to our collections online for free
use.”

Among Google’s potential competitors in the field
of library digitization are members of the Open Content Alliance, which
facilitates various scanning projects around the country and overseas.
Funded largely by Microsoft and the Alfred P. Sloan Foundation, the
O.C.A. has formed alliances with many companies and institutions,
including the Boston Public Library, the American Museum of Natural
History, and Johns Hopkins University. For the moment, though, the
O.C.A.’s members are copying only material in the public domain (and
works from copyright owners who have given explicit permission), which
limits the scope of the projects substantially.

Google’s
advantage may well be cemented if the company settles its lawsuits with
the publishers and authors. “If Google says to the publishers, ‘We’ll
pay,’ that means that everyone else who wants to get into this business
will have to say, ‘We’ll pay,’ ” Lessig said. “The publishers will get
more than the law entitles them to, because Google needs to get this
case behind it. And the settlement will create a huge barrier for any
new entrants in this field.”

In other words, a settlement
could insulate Google from competitors, which would be especially
troubling, because the company has already proved that when it comes to
searches it is not infallible. “Google didn’t get video search
right—YouTube did,” Tim Wu, a professor at Columbia Law School, said.
(Google solved that problem by buying YouTube last year for $1.6
billion.) “Google didn’t get blog search right—technorati.com did,” Wu
went on. “So maybe Google won’t get book search right. But if they
settle the case with the publishers and create huge barriers to
newcomers in the market there won’t be any competition. That’s the
greatest danger here.”

The
most striking thing about Pajama Day at Google was how few people
participated. Most of the rank and file saw the stunt for the
manufactured fun that it was. They came to work in their usual slacker
uniforms of jeans and T-shirts—which are, in their way, as conformist
as white shirts and ties were at I.B.M. in the nineteen-sixties.
Google, as its employees seem to recognize, cannot pretend to be
anything other than a large and powerful corporation.

It’s
easy to mock Google’s unofficial motto—“Don’t be evil”—but there is
nothing evil about Google Book Search. At the same time, there is
nothing inherently virtuous about it. Google has succeeded because, on
the whole, it has developed excellent products; it’s folly to judge the
company’s behavior on moral grounds. Its shareholders certainly don’t.

Nor
can publishers and authors, who are struggling for a way to survive in
a new age, portray their conflict with the company as one between good
and evil. The dual status of several leading publishers as both partner
and adversary to Google underscores their desperate need to hedge their
bets in a digital world that they have yet to master. The publishers’
complaint against Google states that “the Publishers support making
books available in digital form so that those books can be, among other
things, researched through electronic means.” That may be true in
theory, but trade publishers, in particular, have been slow to embrace
new technology, especially for out-of-print books; Google will almost
certainly bring more attention to these works than their own publishers
have.

The law is supposed to resolve issues like
these—between self-interested parties with reasonable claims and
legitimate arguments. But the rules of copyright are so ambiguous, and
the courts so slow, that the judicial system serves largely to
implement the law of the jungle. “There is a real opportunity to move
books into the digital arena,” Marissa Mayer told publishers during the
conference at the New York Public Library. “And we are going to do it
together.”