Vast Amounts of New Data from Books Being Made Available to AI Chatbox Programs like ChatGPT
- by Michael Stillman
A large source of additional information for AI (artificial intelligence) chatbox programs, like ChatGPT or Microsoft's Llama, has been opened. Those are the online search programs that answer just about every question you ask in seconds. A type of software known as “Large Language Models” are able to take vast amounts of data, use it to familiarize itself with manners of speech so as to understand this vast database of information, and then pull out what it needs to answer your question. It is utterly amazing what they do, but they can't do it all by themselves. They know nothing but what they are fed, and if they are to respond from the knowledge of vast amounts of information, that information must come from somewhere.
Much of it comes from the internet, which means they must be enough smart to separate the wheat from the chaff, and “chaff” is an overly polite word for a lot of what is out there. In other words, they also need some more reputable sources of information, and books and other publications are an important source for that. However, many (but not all) of the authors and publishers are not pleased with their work being used without payment. Authors, deservedly, get royalties for their work in books, but not for their work when it is copied and used by AI. They have sued to stop this practice and cite copyright law, as these works are copyrighted.
All of this is in the courts and how it is resolved is as yet unknown. However, a new source has emerged lately. That is from books in libraries. Harvard University announced that they are making their vast dataset of books from their library available to AI models at no cost. Most of this was created almost two decades ago as part of the Google Books project, where Google scanned and digitized millions of books at various libraries. Harvard compiled this and more as part of their Institutional Data Initiative at the Harvard Law Library. Harvard has files for 386 million pages from almost one million books. They are now making it available for services like ChatGPT to learn from and find answers to your questions.
This will be helpful, particularly for understanding historic material, but there is one very major drawback. It is safe to use these books without risk of being sued because they are out of copyright. Copyright terms are 95 years. Therefore, none of these books is less than 95 years old. This will not be much good for providing medical advice, even if it sometimes feels like this must be where RFK Jr. gets his medical recommendations. You want the latest opinions for medical diagnoses and the same for other scientific knowledge. Good luck fixing your computer or car with advice that predates 1930, unless you have a Model T. Of course, these programs already have a lot of later information in place (some of which they are being sued to remove). It just means that these 386 million new pages won't add much to answers you seek for these sorts of questions.
It should be noted that some information Harvard is providing is more recent since it is not subject to copyright. One example is legal case law. These court opinions are available to anyone to read – they need to be for legal experts to understand the law. This recent case law is being provided to the AI models that want to add it.
Update: A few days ago, the first court decision came down in a case of authors suing chatbox for copyright violation. The authors lost. Click here for more.
Forum Auctions Natural History: The remaining stock of Antiquariaat Junk, 1899-2026 25 March 2026
Forum, Mar. 25: Botany.- Andrews (H.C.) Coloured Engravings of Heaths, 4 vol. in 2, first edition, [1710,--94]-1802-1809-[1830]. £10,000 - £15,000.
Forum, Mar. 25: Butterflies.- Cramer (Pierre) and Caspar Stoll. De Uitlandsche Kapellen voorkomende in de drie Waereld-Deelen…,, 5 vol., Amsterdam & Utrecht, 1779-91. £8,000 - £12,000.
Forum, Mar. 25: Voyages.- Darwin (Charles) and others. Narrative of the Surveying Voyages of His Majesty's Ships Adventure and Beagle, 3 vol. in 4, including Appendix to vol.2, first edition, 1839. £8,000 - £12,000.
Forum, Mar. 25: Butterflies.- de Graaf (Willem Diederik Vincent). [Inlandsche Kapellen in beeld], 170 fine original watercolours, [Enkhuizen], [1800-40]. £8,000 - £12,000.
Forum Auctions Natural History: The remaining stock of Antiquariaat Junk, 1899-2026 25 March 2026
Forum, Mar. 25: Birds.- Dresser (Henry Eeles). A History of the Birds of Europe, 9 vol., including supplement, first edition, by the author, 1871-96. £6,000 - £8,000.
Forum, Mar. 25: Zoology.- Felines.- Elliot (Daniel Giraud). A Monograph of the Felidæ or Family of the Cats, first edition, for the Subscribers, by the Author, [1878]-1883. £25,000 - £30,000.
Forum, Mar. 25: Birds.- Frisch (Johann Leonard). Vorstellung der Vögel Deutschlandes, 2 vol., first edition, Berlin, Friedr. Wilhelm Birnsteil, [1736]-1763. £40,000 - £60,000.
Forum, Mar. 25: Birds.- Gould (John). The Birds of Great Britain, 5 vol., first edition, by the author, 1862-1873. £30,000 - £40,000.
Forum Auctions Natural History: The remaining stock of Antiquariaat Junk, 1899-2026 25 March 2026
Forum, Mar. 25: Pomology.- France.- Poiteau (A.) Pomologie Française. Recueil des Plus Beaux Fruits cultivés en France, 4 vol., Paris, 1846. £30,000 - £40,000.
Forum, Mar. 25: Botany.- [Robin (Jean)]. Histoire des Plantes, nouvellement trouvées en l'Isle Virgine…,, 1620; with Geoffrey Linocier L'Histoire des plantes, second edition, 1619-20. £3,000 - £4,000.
Forum, Mar. 25: Asia.- Japan.- Siebold (P.F. von). Nippon. Archiv zur Beschreibung von Japan, 7 parts in 6 vol., first edition, Leyden, [1832]-1852. £35,000 - £45,000.
Forum, Mar. 25: Asia.- Valentijn (Francois). Oud en Nieuw Oost-Indiën..., 5 vol. in 8, first edition, Dordrecht [&] Amsterdam, 1724-26. £8,000 - £12,000.
Forum, Mar. 25: Botany.- Australia.- Redouté (P.J.).- Ventenat (Étienne Pierre). Jardin de la Malmaison, 2 vol.,, Paris, 1803-04[-05]. £30,000 - £40,000.
ALDE, Mar. 11: AUGUSTIN (Saint). De civitate Dei. Rome, Konrad Sweynheym et Arnold Pannartz, 1470. €20,000 - €30,000.
ALDE, Mar. 11: [REGNART (LE LIVRE DE)]. [Le] Docteur en malice, maistre Regnard, demonstrant les ruzes et cautelles qu'il use envers les personnes… Rouen, 1550. €20,000 - €30,000.
ALDE, Mar. 11: TRITHÈME (JEAN). Polygraphie et universelle escriture cabalistique. Paris, [Benoît Prévost pour] Jacques Kerver, 1561. €8,000 - €10,000.
ALDE, Mar. 11: CAUS (SALOMON DE). La Perspective, avec la raison des ombres et des miroirs. Londres, John Norton, 1612.
ALDE, Mar. 11: NICERON (JEAN-FRANÇOIS). La Perspective curieuse ou magie artificielle des effets merveilleux de l'optique. Paris, Pierre Billaine, 1638. €6,000 - €8,000.
ALDE, Mar. 11: VONTET (JACQUES). L’Art de trancher la viande et toute sorte de fruits… S.l.n.d. [probablement Lyon, vers 1647]. €20,000 - €30,000.
ALDE, Mar. 11: HUGO (VICTOR). [Paysage spectral avec une église], [vers 1837]. €20,000 - €30,000.
ALDE, Mar. 11: [HERVEY DE SAINT-DENYS (LÉON D')]. Les Rêves et les Moyens de les diriger. Observations pratiques. Paris, Amyot, 1867. €3,000 - €4,000.
ALDE, Mar. 11: GACHET (PAUL-FERDINAND). Les Chats de Gachet (Manuscrit). S.d. [avant mai 1873]. €6,000 - €8,000.
ALDE, Mar. 11: [REDON (ODILON)]. PICARD (EDMOND). Le Juré. Monodrame en cinq actes… Bruxelles, Mme veuve Monnom, 1887. €7,000 - €9,000.
ALDE, Mar. 11: [TOULOUSE-LAUTREC (HENRI DE) ET HENRI-GABRIEL IBELS]. MONTORGUEIL (GEORGES). Le Café-concert. Paris, [1893]. €4,000 - €5,000.
ALDE, Mar. 11: [TERRY (EMILIO)]. Projet de fontaine. Dessin original au stylo et à l'encre noire. 1938. €2,000 - €3,000.