Where Do AI Programs Get Their Data? It Turns Out Some Comes from Copyrighted Books, Without Permission
- by Michael Stillman
CatGPT?
Where does the information you get from artificial intelligence (AI) sources like ChatGPT come from? It comes from a lot places, including the reams of data on the internet, but a significant source is books. Many, if not most, are of recent vintage as up-to-date information is needed for best answers. As such, most of these books are under copyright. However, the authors and publishers of these books have not been asked for permission nor compensated. Is this legal, an acceptable use of copyrighted works, or a violation of copyright law? Good question. No one knows the answer since it has not been adjudicated in court.
AI programs gain a lot of their data, and learn how language is used so they can give understandable answers, from training databases. These are databases filled with an enormous amount of information. How about the best known AI program, ChatGPT? Did it learn from a training database? To answer this, we went to the ultimate authority to ask, ChatGPT itself. It responded, “Yes, ChatGPT, like other GPT-3 models, is trained on a large and diverse dataset containing a wide range of text from the internet. This dataset includes books, articles, websites, and other sources of human-generated text. The model learns patterns, language structures, and information from this training data, which it then uses to generate responses to user inputs.”
One such online training database is called “The Pile,” and a subset of The Pile is Books3. The Pile contains data from numerous sources, with Book3 providing the book element. It contains 196,000 books, converted to searchable text. It is not necessarily in a format that would allow you to read it as a book, but the text is there. Most are likely copyrighted but used without permission. It was freely available on the internet to anyone seeking to build an AI model. Its creator made it so, as he wanted even small developers to have a shot at creating a model.
Books3 was recently removed from the internet. It was taken down after Rights Alliance, a group representing Danish publishers, made the request. They determined that 150 titles used were published by their members. The Eye, the website hosting Books3, complied.
This issue is already starting to appear in court and we can expect to see more of this until some sort of decision is reached on where AI training databases and copyright law intersect. It is argued that this is “Fair Use,” a doctrine that allows you to quote brief parts of a book without running afoul of copyright law. This can be argued to be similar, without even direct quoting. It is sort of like conducting research in a library. However, it is also true these databases have copied entire books to do their searching. It is also notable that the authors are not being compensated, while at risk of losing sales to people who would rather do their research through services like ChatGPT. Of course, the database compiler can license the material from the publisher, but that would require many deals with many people, and it might be prohibitively expensive for all but the largest corporations. That is what the Books3 founder sought to avoid. Maybe ChatGPT can come up with an answer to this dilemma.
Note on illustration. What the...? I asked ChatGPT's image generator for a picture of ChatGPT. This is what it gave me. Why? Who knows. Perhaps it has to do with the French word for “cat” being “chat,” but who knows what it's artificial mind was thinking. Hopefully, it's textual answers are a little better.
Il Ponte, Feb. 25-26: HAMILTON, Sir William (1730-1803) - Campi Phlegraei. Napoli: [Pietro Fabris], 1776, 1779. € 30.000 - 50.000
Il Ponte, Feb. 25-26: [MORTIER] - BLAEU, Joannes (1596-1673) - Het Nieuw Stede Boek van Italie. Amsterdam: Pieter Mortier, 1704-1705. € 15.000 - 25.000
Il Ponte, Feb. 25-26: TULLIO D'ALBISOLA (1899-1971) - Bruno MUNARI (1907-1998) - L'Anguria lirica (lungo poema passionale). Roma e Savona: Edizioni Futuriste di Poesia, senza data [ma 1933?]. € 20.000 - 30.000
Il Ponte, Feb. 25-26: IL MANOSCRITTO RITROVATO DI IPPOLITA MARIA SFORZA. TITO LIVIO - Ab Urbe Condita. Prima Decade. Manoscritto miniato su pergamena, metà XV secolo. € 280.000 - 350.000
Sotheby's Fine Books & Manuscripts Available for Immediate Purchase
Sotheby’s: Balthus, Emily Brontë. Wuthering Heights, New York: The Limited Editions Club, 1993. 6,600 USD.
Sotheby’s: Charles Dickens. Complete Works, Philadelphia & London: J.B. Lippincott Company & Chapman & Hall, LD, 1850. Limited Edition set of 30 volumes. 7,500 USD.
Sotheby’s: John Lennon, Yoko Ono. Handwritten Letter from John Lennon and Yoko Ono to their Chauffer. 1971. 32,500 USD.
Sotheby’s: Winston Churchill. First edition of War Speeches, Cassell and Company, Ltd., 1941. Set of 7 volumes. 5,500 USD.
Sotheby’s: Andy Warhol, Julia Warhola. Holy Cats First Edition, Signed by Andy Warhol. 1954. 30,000 USD.
Old World Auctions (Feb 11): Lot 11. Blaeu's Superb World Map on a Polar Projection (1695) Est. $5,500 - $7,000
Old World Auctions (Feb 11): Lot 36. Schedel's Ancient World Map with Humanoid Creatures (1493) Est. $14,000 - $17,000
Old World Auctions (Feb 11): Lot 49. One of the First Lunar Globes to Show the Far Side of the Moon (1963) Est. $1,000 - $1,300
Old World Auctions (Feb 11): Lot 5. The First World Map with Lavish Allegorical Vignettes of the Continents (1594) Est. $15,000 - $17,000
Old World Auctions (Feb 11): Lot 55. Anti-British Propaganda Map with Churchill as an Octopus (1942) Est. $2,000 - $2,300
Old World Auctions (Feb 11): Lot 197. One of the Most Influential Maps of Westward Expansion (1846) Est. $9,500 - $12,000
Old World Auctions (Feb 11): Lot 10. Scarce Pitt Edition of Carte-a-Figures Map of the World (1680) Est. $9,500 - $11,000
Old World Auctions (Feb 11): Lot 220. A Fine, Early Rendering of San Francisco (1874) Est. $2,200 - $2,500
Old World Auctions (Feb 11): Lot 707. Hand-Colored Image of the Presentation of Jesus with Gilt Highlights (1450) Est. $1,600 - $1,900
Old World Auctions (Feb 11): Lot 80. One of the Most Important Maps Perpetuating the Myth of the Island of California (1680) Est. $3,250 - $4,000
Old World Auctions (Feb 11): Lot 725. Homann's Atlas Featuring 26 Folio-Sized Maps in Original Color (1715) Est. $4,500 - $5,500
Old World Auctions (Feb 11): Lot 169. One of the Earliest Maps to Show Philadelphia (1695) Est. $4,750 - $6,000
Gros & Delettrez, Feb. 11: DALVIMART, Octavien ou d’ALVIMAR(T). The Costume of Turkey
Gros & Delettrez, Feb. 11: DALVIMART, Octavien ou d’ALVIMAR(T)]. CLARK. The Military Costume of Turkey
Gros & Delettrez, Feb. 11: HOMMAIRE DE HELL, Ignace-Xavier. LAURENS, Jules. Voyage en Turquie et en Perse
Gros & Delettrez, Feb. 11: POSTEL, Guillaume. De la République des Turc
Forum Auctions Online: India Ends 19th February 2026
Forum, Feb. 19: Lot 40 Ramasvami (Kavali Venkata). A Digest of the Different Castes of India, 83 charming hand-coloured lithographed plates, Madras, 1837. £5,000-7,000
Forum, Feb. 19: Lot 50 Watson (John Forbes) & John William Kaye. The People of India: A Series of Photographic Illustrations...of the Races and Tribes of Hindustan, 8 vol., 480 mounted albumen prints, 1868-75. £4,000-6,000
Forum, Feb. 19: Lot 53 Afghanistan.- Elphinstone (Hon. Mountstuart). An Account of the Kingdom of Caubul, first edition, hand-coloured aquatint plates, a fine copy, 1815. £2,000-3,000
Forum, Feb. 19: Lot 57 [Album and Treatise on Hinduism], manuscript treatise on Hinduism in French, 31 watercolours of Hindu deities, Pondicherry, 1865. £3,000-4,000
Forum, Feb. 19: Lot 62 Allan (Capt. Alexander). Views in the Mysore Country,
[1794]. £2,000-3,000
Forum Auctions Online: India Ends 19th February 2026
Forum, Feb. 19: Lot 76 Bird (James). Historical Researches on the Origin and Principles of the Bauddha and Jaina Religions..., first edition, lithographed plates, Bombay, American Mission Press, 1847. £3,000-4,000
Forum, Feb. 19: Lot 100 Ceylon.- Daniell (Samuel). A Picturesque Illustration of the scenery, animals, and native inhabitants, of the Island of Ceylon: in twelve plates, 1808. £5,000-7,000
Forum, Feb. 19: Lot 123 D'Oyly (Charles). Behar Amateur Lithographic Scrap Book, lithographed throughout with title and 55 plates mounted on 43 paper leaves, [Patna], [1828]. £3,000-5,000
Forum, Feb. 19: Lot 139 Gandhi (known as Mahatma Gandhi,) Fine Autograph Letter signed to Jawaharlal Nehru, Sevagram, Wardha, 1942, emphasising the importance of education in rural communities. £10,000-15,000
Forum Auctions Online: India Ends 19th February 2026
Forum, Feb. 19: Lot 140 Gantz (John). Indian Microcosm, first edition, Madras, John Gantz & Son, 1827. £10,000-15,000
Forum, Feb. 19: Lot 146 Grierson (Sir George Abraham). Linguistic Survey of India, 11 vol. in 20, folding maps, original cloth, Calcutta, Superintendent Government Printing, 1903-28. £2,000-3,000
Forum, Feb. 19: Lot 195 Madras.- Fort St. George Gazette (The), No.276-331, pp.493-936 and Index to all of 1834 at end, modern half calf, Madras, 2nd July - 31st December 1834. £2,000-3,000
Forum, Feb. 19: Lot 205 Marshall (Sir John) and Alfred Foucher. The Monuments of Sanchi, 3 vol., first edition, 141 plates, most photogravure, [Calcutta], [1940]. £3,000-4,000