Where Do AI Programs Get Their Data? It Turns Out Some Comes from Copyrighted Books, Without Permission
- by Michael Stillman
CatGPT?
Where does the information you get from artificial intelligence (AI) sources like ChatGPT come from? It comes from a lot places, including the reams of data on the internet, but a significant source is books. Many, if not most, are of recent vintage as up-to-date information is needed for best answers. As such, most of these books are under copyright. However, the authors and publishers of these books have not been asked for permission nor compensated. Is this legal, an acceptable use of copyrighted works, or a violation of copyright law? Good question. No one knows the answer since it has not been adjudicated in court.
AI programs gain a lot of their data, and learn how language is used so they can give understandable answers, from training databases. These are databases filled with an enormous amount of information. How about the best known AI program, ChatGPT? Did it learn from a training database? To answer this, we went to the ultimate authority to ask, ChatGPT itself. It responded, “Yes, ChatGPT, like other GPT-3 models, is trained on a large and diverse dataset containing a wide range of text from the internet. This dataset includes books, articles, websites, and other sources of human-generated text. The model learns patterns, language structures, and information from this training data, which it then uses to generate responses to user inputs.”
One such online training database is called “The Pile,” and a subset of The Pile is Books3. The Pile contains data from numerous sources, with Book3 providing the book element. It contains 196,000 books, converted to searchable text. It is not necessarily in a format that would allow you to read it as a book, but the text is there. Most are likely copyrighted but used without permission. It was freely available on the internet to anyone seeking to build an AI model. Its creator made it so, as he wanted even small developers to have a shot at creating a model.
Books3 was recently removed from the internet. It was taken down after Rights Alliance, a group representing Danish publishers, made the request. They determined that 150 titles used were published by their members. The Eye, the website hosting Books3, complied.
This issue is already starting to appear in court and we can expect to see more of this until some sort of decision is reached on where AI training databases and copyright law intersect. It is argued that this is “Fair Use,” a doctrine that allows you to quote brief parts of a book without running afoul of copyright law. This can be argued to be similar, without even direct quoting. It is sort of like conducting research in a library. However, it is also true these databases have copied entire books to do their searching. It is also notable that the authors are not being compensated, while at risk of losing sales to people who would rather do their research through services like ChatGPT. Of course, the database compiler can license the material from the publisher, but that would require many deals with many people, and it might be prohibitively expensive for all but the largest corporations. That is what the Books3 founder sought to avoid. Maybe ChatGPT can come up with an answer to this dilemma.
Note on illustration. What the...? I asked ChatGPT's image generator for a picture of ChatGPT. This is what it gave me. Why? Who knows. Perhaps it has to do with the French word for “cat” being “chat,” but who knows what it's artificial mind was thinking. Hopefully, it's textual answers are a little better.
Forum Auctions Natural History: The remaining stock of Antiquariaat Junk, 1899-2026 25 March 2026
Forum, Mar. 25: Botany.- Andrews (H.C.) Coloured Engravings of Heaths, 4 vol. in 2, first edition, [1710,--94]-1802-1809-[1830]. £10,000 - £15,000.
Forum, Mar. 25: Butterflies.- Cramer (Pierre) and Caspar Stoll. De Uitlandsche Kapellen voorkomende in de drie Waereld-Deelen…,, 5 vol., Amsterdam & Utrecht, 1779-91. £8,000 - £12,000.
Forum, Mar. 25: Voyages.- Darwin (Charles) and others. Narrative of the Surveying Voyages of His Majesty's Ships Adventure and Beagle, 3 vol. in 4, including Appendix to vol.2, first edition, 1839. £8,000 - £12,000.
Forum, Mar. 25: Butterflies.- de Graaf (Willem Diederik Vincent). [Inlandsche Kapellen in beeld], 170 fine original watercolours, [Enkhuizen], [1800-40]. £8,000 - £12,000.
Forum Auctions Natural History: The remaining stock of Antiquariaat Junk, 1899-2026 25 March 2026
Forum, Mar. 25: Birds.- Dresser (Henry Eeles). A History of the Birds of Europe, 9 vol., including supplement, first edition, by the author, 1871-96. £6,000 - £8,000.
Forum, Mar. 25: Zoology.- Felines.- Elliot (Daniel Giraud). A Monograph of the Felidæ or Family of the Cats, first edition, for the Subscribers, by the Author, [1878]-1883. £25,000 - £30,000.
Forum, Mar. 25: Birds.- Frisch (Johann Leonard). Vorstellung der Vögel Deutschlandes, 2 vol., first edition, Berlin, Friedr. Wilhelm Birnsteil, [1736]-1763. £40,000 - £60,000.
Forum, Mar. 25: Birds.- Gould (John). The Birds of Great Britain, 5 vol., first edition, by the author, 1862-1873. £30,000 - £40,000.
Forum Auctions Natural History: The remaining stock of Antiquariaat Junk, 1899-2026 25 March 2026
Forum, Mar. 25: Pomology.- France.- Poiteau (A.) Pomologie Française. Recueil des Plus Beaux Fruits cultivés en France, 4 vol., Paris, 1846. £30,000 - £40,000.
Forum, Mar. 25: Botany.- [Robin (Jean)]. Histoire des Plantes, nouvellement trouvées en l'Isle Virgine…,, 1620; with Geoffrey Linocier L'Histoire des plantes, second edition, 1619-20. £3,000 - £4,000.
Forum, Mar. 25: Asia.- Japan.- Siebold (P.F. von). Nippon. Archiv zur Beschreibung von Japan, 7 parts in 6 vol., first edition, Leyden, [1832]-1852. £35,000 - £45,000.
Forum, Mar. 25: Asia.- Valentijn (Francois). Oud en Nieuw Oost-Indiën..., 5 vol. in 8, first edition, Dordrecht [&] Amsterdam, 1724-26. £8,000 - £12,000.
Forum, Mar. 25: Botany.- Australia.- Redouté (P.J.).- Ventenat (Étienne Pierre). Jardin de la Malmaison, 2 vol.,, Paris, 1803-04[-05]. £30,000 - £40,000.
ALDE, Mar. 11: AUGUSTIN (Saint). De civitate Dei. Rome, Konrad Sweynheym et Arnold Pannartz, 1470. €20,000 - €30,000.
ALDE, Mar. 11: [REGNART (LE LIVRE DE)]. [Le] Docteur en malice, maistre Regnard, demonstrant les ruzes et cautelles qu'il use envers les personnes… Rouen, 1550. €20,000 - €30,000.
ALDE, Mar. 11: TRITHÈME (JEAN). Polygraphie et universelle escriture cabalistique. Paris, [Benoît Prévost pour] Jacques Kerver, 1561. €8,000 - €10,000.
ALDE, Mar. 11: CAUS (SALOMON DE). La Perspective, avec la raison des ombres et des miroirs. Londres, John Norton, 1612.
ALDE, Mar. 11: NICERON (JEAN-FRANÇOIS). La Perspective curieuse ou magie artificielle des effets merveilleux de l'optique. Paris, Pierre Billaine, 1638. €6,000 - €8,000.
ALDE, Mar. 11: VONTET (JACQUES). L’Art de trancher la viande et toute sorte de fruits… S.l.n.d. [probablement Lyon, vers 1647]. €20,000 - €30,000.
ALDE, Mar. 11: HUGO (VICTOR). [Paysage spectral avec une église], [vers 1837]. €20,000 - €30,000.
ALDE, Mar. 11: [HERVEY DE SAINT-DENYS (LÉON D')]. Les Rêves et les Moyens de les diriger. Observations pratiques. Paris, Amyot, 1867. €3,000 - €4,000.
ALDE, Mar. 11: GACHET (PAUL-FERDINAND). Les Chats de Gachet (Manuscrit). S.d. [avant mai 1873]. €6,000 - €8,000.
ALDE, Mar. 11: [REDON (ODILON)]. PICARD (EDMOND). Le Juré. Monodrame en cinq actes… Bruxelles, Mme veuve Monnom, 1887. €7,000 - €9,000.
ALDE, Mar. 11: [TOULOUSE-LAUTREC (HENRI DE) ET HENRI-GABRIEL IBELS]. MONTORGUEIL (GEORGES). Le Café-concert. Paris, [1893]. €4,000 - €5,000.
ALDE, Mar. 11: [TERRY (EMILIO)]. Projet de fontaine. Dessin original au stylo et à l'encre noire. 1938. €2,000 - €3,000.