Rare Book Monthly

Articles - September - 2023 Issue

Where Do AI Programs Get Their Data? It Turns Out Some Comes from Copyrighted Books, Without Permission

CatGPT?

CatGPT?

Where does the information you get from artificial intelligence (AI) sources like ChatGPT come from? It comes from a lot places, including the reams of data on the internet, but a significant source is books. Many, if not most, are of recent vintage as up-to-date information is needed for best answers. As such, most of these books are under copyright. However, the authors and publishers of these books have not been asked for permission nor compensated. Is this legal, an acceptable use of copyrighted works, or a violation of copyright law? Good question. No one knows the answer since it has not been adjudicated in court.

 

AI programs gain a lot of their data, and learn how language is used so they can give understandable answers, from training databases. These are databases filled with an enormous amount of information. How about the best known AI program, ChatGPT? Did it learn from a training database? To answer this, we went to the ultimate authority to ask, ChatGPT itself. It responded, “Yes, ChatGPT, like other GPT-3 models, is trained on a large and diverse dataset containing a wide range of text from the internet. This dataset includes books, articles, websites, and other sources of human-generated text. The model learns patterns, language structures, and information from this training data, which it then uses to generate responses to user inputs.”

 

One such online training database is called “The Pile,” and a subset of The Pile is Books3. The Pile contains data from numerous sources, with Book3 providing the book element. It contains 196,000 books, converted to searchable text. It is not necessarily in a format that would allow you to read it as a book, but the text is there. Most are likely copyrighted but used without permission. It was freely available on the internet to anyone seeking to build an AI model. Its creator made it so, as he wanted even small developers to have a shot at creating a model.

 

Books3 was recently removed from the internet. It was taken down after Rights Alliance, a group representing Danish publishers, made the request. They determined that 150 titles used were published by their members. The Eye, the website hosting Books3, complied.

 

This issue is already starting to appear in court and we can expect to see more of this until some sort of decision is reached on where AI training databases and copyright law intersect. It is argued that this is “Fair Use,” a doctrine that allows you to quote brief parts of a book without running afoul of copyright law. This can be argued to be similar, without even direct quoting. It is sort of like conducting research in a library. However, it is also true these databases have copied entire books to do their searching. It is also notable that the authors are not being compensated, while at risk of losing sales to people who would rather do their research through services like ChatGPT. Of course, the database compiler can license the material from the publisher, but that would require many deals with many people, and it might be prohibitively expensive for all but the largest corporations. That is what the Books3 founder sought to avoid. Maybe ChatGPT can come up with an answer to this dilemma.

 

 

Note on illustration. What the...? I asked ChatGPT's image generator for a picture of ChatGPT. This is what it gave me. Why? Who knows. Perhaps it has to do with the French word for “cat” being “chat,” but who knows what it's artificial mind was thinking. Hopefully, it's textual answers are a little better.


Posted On: 2023-09-01 12:11
User Name: PeterReynolds

Textual answers better? Not in my experience. I asked it for the chapter titles of a book which it knew how to find online, formatted as a numbered list. It would only give me a list of chapters that it felt ought to be in books of this type, not the ones in the particular book, despite being able to point me to where I could find and read the book online.


Rare Book Monthly

  • Bonhams, Dec. 8-18: Autograph Letter Signed ("Martinus Luther") to His Friend the Theologian Gerhard Wiskamp ("Gerardo Xantho Lampadario"). $100,000 - $150,000.
    Bonhams, Dec. 8-18: An Exceptionally Fine Copy of Austenís Emma: A Novel in Three Volumes. $40,000 - $60,000.
    Bonhams, Dec. 8-18: Presentation Copy of Ernest Hemmingwayís A Farewell to Arms for Edward Titus of the Black Mankin Press. $30,000 - $50,000.
    Bonhams, Dec. 8-18: Autograph Manuscript Signed Integrally for "The Songs of Pooh," by Alan Alexander. $30,000 - $50,000.
    Bonhams, Dec. 8-18: Autograph Manuscript of "Three Fragments from Gˆtterd‰mmerung" by Richard Wagner. $30,000 - $50,000.
    Bonhams, Dec. 8-18: Original Preliminary Artwork, for the First Edition of Snow Crash. $20,000 - $30,000.
    Bonhams, Dec. 8-18: Autograph Letter Signed ("T.R. Malthus") to Economist Nassau Senior on Wealth, Labor and Adam Smith. $20,000 - $30,000.
    Bonhams, Dec. 8-18: History of the Peloponnesian War by Thucydides Finely Bound by Michael Wilcox. $20,000 - $30,000.
    Bonhams, Dec. 8-18: First Edition of Lewis and Clark: Travels to the Source of the Missouri River and Across the American Continent to the Pacific Ocean. $8,000 - $12,000.
    Bonhams, Dec. 8-18: Original Artwork for the First Edition of Neal Stephenson's Groundbreaking Novel Snow Crash. $100,000 - $150,000.
    Bonhams, Dec. 8-18: A Complete Set Signed Deluxe Editions of King's The Dark Tower Series by Stephen King. $8,000 - $12,000.
    Bonhams, Dec. 8-18: Autograph Letter Signed ("John Adams") to James Le Ray de Chaumont During the Crucial Years of the Revolutionary War. $8,000 - $12,000.
  • Sotheby’s
    Book Week
    December 9-17, 2025
    Sotheby’s, Dec. 17: Francesco Colonna. Hypnerotomachie, Paris, 1546, Parisian calf by Wotton Binder C for Marcus Fugger. €200,000 to €300,000.
    Sotheby’s, Dec. 17: Nausea. De principiis dialectices Gorgias, and other works, Venice, 1523, morocco gilt for Cardinal Campeggio. €3,000 to €4,000.
    Sotheby’s, Dec. 17: Billon. Le fort inexpugnable de l'honneur, Paris, 1555, Parisian calf gilt for Peter Ernst, Graf von Mansfeld. €120,000 to €180,000.
    Sotheby’s
    Book Week
    December 9-17, 2025
    Sotheby’s, Dec. 16: Salinger, J.D. The Graham Family archive, including autographed letters, an inscribed Catcher, a rare studio photograph of the author, and more. $120,000 to $180,000.
    Sotheby’s, Dec. 16: [Austen, Jane]. A handsome first edition of Sense and Sensibility, the author's first novel. $60,000 to $80,000.
    Sotheby’s, Dec. 16: Massachusetts General Court. A powerful precursor to the Declaration of Independence: "every Act of Government … without the Consent of the People, is … Tyranny." $40,000 to $60,000.
  • Heritage Auctions
    Rare Books Signature Auction
    December 15, 2025
    Heritage, Dec. 15: John Donne. Poems, By J. D. With Elegies on the Author's Death. London: M[iles]. F[lesher]. for John Marriot, 1633.
    Heritage, Dec. 15: Edgar Rice Burroughs. Tarzan of the Apes.
    Heritage, Dec. 15: F. Scott Fitzgerald. Tender is the Night. A Romance.
    Heritage, Dec. 15: Bram Stoker. Dracula. Westminster: Archibald Constable & Co., 1897.
    Heritage, Dec. 15: Jerry Thomas. How to Mix Drinks, or the Bon-Vivant's Companion, Containing Clear and Reliable Directions for Mixing All the Beverages Used in the United States…
  • Rare Book Hub is now mobile-friendly!

Article Search

Archived Articles