Where Do AI Programs Get Their Data? It Turns Out Some Comes from Copyrighted Books, Without Permission
- by Michael Stillman
CatGPT?
Where does the information you get from artificial intelligence (AI) sources like ChatGPT come from? It comes from a lot places, including the reams of data on the internet, but a significant source is books. Many, if not most, are of recent vintage as up-to-date information is needed for best answers. As such, most of these books are under copyright. However, the authors and publishers of these books have not been asked for permission nor compensated. Is this legal, an acceptable use of copyrighted works, or a violation of copyright law? Good question. No one knows the answer since it has not been adjudicated in court.
AI programs gain a lot of their data, and learn how language is used so they can give understandable answers, from training databases. These are databases filled with an enormous amount of information. How about the best known AI program, ChatGPT? Did it learn from a training database? To answer this, we went to the ultimate authority to ask, ChatGPT itself. It responded, “Yes, ChatGPT, like other GPT-3 models, is trained on a large and diverse dataset containing a wide range of text from the internet. This dataset includes books, articles, websites, and other sources of human-generated text. The model learns patterns, language structures, and information from this training data, which it then uses to generate responses to user inputs.”
One such online training database is called “The Pile,” and a subset of The Pile is Books3. The Pile contains data from numerous sources, with Book3 providing the book element. It contains 196,000 books, converted to searchable text. It is not necessarily in a format that would allow you to read it as a book, but the text is there. Most are likely copyrighted but used without permission. It was freely available on the internet to anyone seeking to build an AI model. Its creator made it so, as he wanted even small developers to have a shot at creating a model.
Books3 was recently removed from the internet. It was taken down after Rights Alliance, a group representing Danish publishers, made the request. They determined that 150 titles used were published by their members. The Eye, the website hosting Books3, complied.
This issue is already starting to appear in court and we can expect to see more of this until some sort of decision is reached on where AI training databases and copyright law intersect. It is argued that this is “Fair Use,” a doctrine that allows you to quote brief parts of a book without running afoul of copyright law. This can be argued to be similar, without even direct quoting. It is sort of like conducting research in a library. However, it is also true these databases have copied entire books to do their searching. It is also notable that the authors are not being compensated, while at risk of losing sales to people who would rather do their research through services like ChatGPT. Of course, the database compiler can license the material from the publisher, but that would require many deals with many people, and it might be prohibitively expensive for all but the largest corporations. That is what the Books3 founder sought to avoid. Maybe ChatGPT can come up with an answer to this dilemma.
Note on illustration. What the...? I asked ChatGPT's image generator for a picture of ChatGPT. This is what it gave me. Why? Who knows. Perhaps it has to do with the French word for “cat” being “chat,” but who knows what it's artificial mind was thinking. Hopefully, it's textual answers are a little better.
Swann Maps & Atlases, Natural History & Color Plate Books December 9, 2025
Swann, Dec. 9: Lot 156: Cornelis de Jode, Americae pars Borealis, double-page engraved map of North America, Antwerp, 1593.
Swann, Dec. 9: Lot 206: John and Alexander Walker, Map of the United States, London and Liverpool, 1827.
Swann, Dec. 9: Lot 223: Abraham Ortelius, Typus Orbis Terrarum, hand-colored double-page engraved world map, Antwerp, 1575.
Swann Maps & Atlases, Natural History & Color Plate Books December 9, 2025
Swann, Dec. 9: Lot 233: Aaron Arrowsmith, Chart of the World, oversize engraved map on 8 sheets, London, 1790 (circa 1800).
Swann, Dec. 9: Lot 239: Fielding Lucas, A General Atlas, 81 engraved maps and diagrams, Baltimore, 1823.
Swann, Dec. 9: Lot 240: Anthony Finley, A New American Atlas, 15 maps engraved by james hamilton young on 14 double-page sheets, Philadelphia, 1826.
Swann Maps & Atlases, Natural History & Color Plate Books December 9, 2025
Swann, Dec. 9: Lot 263: John Bachmann, Panorama of the Seat of War, portfolio of 4 double-page chromolithographed panoramic maps, New York, 1861.
Swann, Dec. 9: Lot 265: Sebastian Münster, Cosmographei, Basel: Sebastian Henricpetri, 1558.
Swann, Dec. 9: Lot 271: Abraham Ortelius, Epitome Theatri Orteliani, Antwerp: Johann Baptist Vrients, 1601.
Swann Maps & Atlases, Natural History & Color Plate Books December 9, 2025
Swann, Dec. 9: Lot 283: Joris van Spilbergen, Speculum Orientalis Occidentalisque Indiae, Leiden: Nicolaus van Geelkercken for Jodocus Hondius, 1619.
Swann, Dec. 9: Lot 285: Levinus Hulsius, Achtzehender Theil der Newen Welt, 14 engraved folding maps, Frankfurt: Johann Frederick Weiss, 1623.
Swann, Dec. 9: Lot 341: John James Audubon, Carolina Parrot, Plate 26, London, 1827.
SD Scandinavian Art & Rare Book Auctions The Odfjell Collection Polar – History – Ornithology – Colour Plate Books Ending December 4th
Scandinavian Art & Rare Books Auctions, Dec. 4: ROALD AMUNDSEN: «Sydpolen» [ The South Pole] 1912. First edition in jackets and publisher's slip case.
Scandinavian Art & Rare Books Auctions, Dec. 4: AMUNDSEN & NANSEN: «Fram over Polhavet» [Farthest North] 1897. AMUNDSEN's COPY!
Scandinavian Art & Rare Books Auctions, Dec. 4: ERNEST SHACKLETON [ed.]: «Aurora Australis» 1908. First edition. The NORWAY COPY.
Scandinavian Art & Rare Books Auctions, Dec. 4: ERNEST SHACKLETON: «The heart of the Antarctic» + SUPPLEMENT «The Antarctic Book», 1909.
Scandinavian Art & Rare Books Auctions, Dec. 4: SHACKLETON, BERNACCHI, CHERRY-GARRARD [ed.]: «The South Polar Times» I-III, 1902-1911.
SD Scandinavian Art & Rare Book Auctions The Odfjell Collection Polar – History – Ornithology – Colour Plate Books Ending December 4th
Scandinavian Art & Rare Books Auctions, Dec. 4: [WILLEM BARENTSZ & HENRY HUDSON] - SAEGHMAN: «Verhael van de vier eerste schip-vaerden […]», 1663.
Scandinavian Art & Rare Books Auctions, Dec. 4: TERRA NOVA EXPEDITION | LIEUTENANT HENRY ROBERTSON BOWERS: «At the South Pole.», Gelatin Silver Print. [10¾ x 15in. (27.2 x 38.1cm.) ].
Scandinavian Art & Rare Books Auctions, Dec. 4: ELEAZAR ALBIN: «A natural History of Birds.» + «A Supplement», 1738-40. Wonderful coloured plates.
Scandinavian Art & Rare Books Auctions, Dec. 4: PAUL GAIMARD: «Voyage de la Commision scientific du Nord, en Scandinavie, […]», c. 1842-46. ONLY HAND COLOURED COPY KNOWN WITH TWO ORIGINAL PAINTINGS BY BIARD.
Scandinavian Art & Rare Books Auctions, Dec. 4: JAMES JOYCE: «Ulysses», 1922. FIRST EDITION IN ORIGINAL WRAPPERS.
Sotheby’s Book Week December 9-17, 2025
Sotheby’s, Dec. 11: Darwin and Wallace. On the Tendency of Species to form Varieties..., [in:] Journal of the Proceedings of the Linnean Society, Vol. III, No. 9., 1858, Darwin announces the theory of natural selection. £100,000 to £150,000.
Sotheby’s, Dec. 11: J.K. Rowling. Harry Potter and the Philosopher's Stone, 1997, first edition, hardback issue, inscribed by the author pre-publication. £100,000 to £150,000.
Sotheby’s, Dec. 11: Wolfgang Amadeus Mozart. Autograph sketchleaf including a probable draft for the E flat Piano Quartet, K.493, 1786. £150,000 to £200,000.
Sotheby’s, Dec. 12: Hooke, Robert. Micrographia: or some Physiological Descriptions of Minute Bodies made by Magnifying Glasses. London: James Allestry for the Royal Society, 1667. $12,000 to $15,000.
Sotheby’s, Dec. 12: Chappuzeau, Samuel. The history of jewels, first edition in English. London: T.N. for Hobart Kemp, 1671. $12,000 to $18,000.
Sotheby’s, Dec. 12: Sowerby, James. Exotic Mineralogy, containing his most realistic mineral depictions, London: Benjamin Meredith, 1811, Arding and Merrett, 1817. $5,000 to $7,000.