Vast Amounts of New Data from Books Being Made Available to AI Chatbox Programs like ChatGPT
- by Michael Stillman
A large source of additional information for AI (artificial intelligence) chatbox programs, like ChatGPT or Microsoft's Llama, has been opened. Those are the online search programs that answer just about every question you ask in seconds. A type of software known as “Large Language Models” are able to take vast amounts of data, use it to familiarize itself with manners of speech so as to understand this vast database of information, and then pull out what it needs to answer your question. It is utterly amazing what they do, but they can't do it all by themselves. They know nothing but what they are fed, and if they are to respond from the knowledge of vast amounts of information, that information must come from somewhere.
Much of it comes from the internet, which means they must be enough smart to separate the wheat from the chaff, and “chaff” is an overly polite word for a lot of what is out there. In other words, they also need some more reputable sources of information, and books and other publications are an important source for that. However, many (but not all) of the authors and publishers are not pleased with their work being used without payment. Authors, deservedly, get royalties for their work in books, but not for their work when it is copied and used by AI. They have sued to stop this practice and cite copyright law, as these works are copyrighted.
All of this is in the courts and how it is resolved is as yet unknown. However, a new source has emerged lately. That is from books in libraries. Harvard University announced that they are making their vast dataset of books from their library available to AI models at no cost. Most of this was created almost two decades ago as part of the Google Books project, where Google scanned and digitized millions of books at various libraries. Harvard compiled this and more as part of their Institutional Data Initiative at the Harvard Law Library. Harvard has files for 386 million pages from almost one million books. They are now making it available for services like ChatGPT to learn from and find answers to your questions.
This will be helpful, particularly for understanding historic material, but there is one very major drawback. It is safe to use these books without risk of being sued because they are out of copyright. Copyright terms are 95 years. Therefore, none of these books is less than 95 years old. This will not be much good for providing medical advice, even if it sometimes feels like this must be where RFK Jr. gets his medical recommendations. You want the latest opinions for medical diagnoses and the same for other scientific knowledge. Good luck fixing your computer or car with advice that predates 1930, unless you have a Model T. Of course, these programs already have a lot of later information in place (some of which they are being sued to remove). It just means that these 386 million new pages won't add much to answers you seek for these sorts of questions.
It should be noted that some information Harvard is providing is more recent since it is not subject to copyright. One example is legal case law. These court opinions are available to anyone to read – they need to be for legal experts to understand the law. This recent case law is being provided to the AI models that want to add it.
Update: A few days ago, the first court decision came down in a case of authors suing chatbox for copyright violation. The authors lost. Click here for more.
Forum Auctions Online: India Ends 19th February 2026
Forum, Feb. 19: Lot 40 Ramasvami (Kavali Venkata). A Digest of the Different Castes of India, 83 charming hand-coloured lithographed plates, Madras, 1837. £5,000-7,000
Forum, Feb. 19: Lot 50 Watson (John Forbes) & John William Kaye. The People of India: A Series of Photographic Illustrations...of the Races and Tribes of Hindustan, 8 vol., 480 mounted albumen prints, 1868-75. £4,000-6,000
Forum, Feb. 19: Lot 53 Afghanistan.- Elphinstone (Hon. Mountstuart). An Account of the Kingdom of Caubul, first edition, hand-coloured aquatint plates, a fine copy, 1815. £2,000-3,000
Forum, Feb. 19: Lot 57 [Album and Treatise on Hinduism], manuscript treatise on Hinduism in French, 31 watercolours of Hindu deities, Pondicherry, 1865. £3,000-4,000
Forum, Feb. 19: Lot 62 Allan (Capt. Alexander). Views in the Mysore Country,
[1794]. £2,000-3,000
Forum Auctions Online: India Ends 19th February 2026
Forum, Feb. 19: Lot 76 Bird (James). Historical Researches on the Origin and Principles of the Bauddha and Jaina Religions..., first edition, lithographed plates, Bombay, American Mission Press, 1847. £3,000-4,000
Forum, Feb. 19: Lot 100 Ceylon.- Daniell (Samuel). A Picturesque Illustration of the scenery, animals, and native inhabitants, of the Island of Ceylon: in twelve plates, 1808. £5,000-7,000
Forum, Feb. 19: Lot 123 D'Oyly (Charles). Behar Amateur Lithographic Scrap Book, lithographed throughout with title and 55 plates mounted on 43 paper leaves, [Patna], [1828]. £3,000-5,000
Forum, Feb. 19: Lot 139 Gandhi (known as Mahatma Gandhi,) Fine Autograph Letter signed to Jawaharlal Nehru, Sevagram, Wardha, 1942, emphasising the importance of education in rural communities. £10,000-15,000
Forum Auctions Online: India Ends 19th February 2026
Forum, Feb. 19: Lot 140 Gantz (John). Indian Microcosm, first edition, Madras, John Gantz & Son, 1827. £10,000-15,000
Forum, Feb. 19: Lot 146 Grierson (Sir George Abraham). Linguistic Survey of India, 11 vol. in 20, folding maps, original cloth, Calcutta, Superintendent Government Printing, 1903-28. £2,000-3,000
Forum, Feb. 19: Lot 195 Madras.- Fort St. George Gazette (The), No.276-331, pp.493-936 and Index to all of 1834 at end, modern half calf, Madras, 2nd July - 31st December 1834. £2,000-3,000
Forum, Feb. 19: Lot 205 Marshall (Sir John) and Alfred Foucher. The Monuments of Sanchi, 3 vol., first edition, 141 plates, most photogravure, [Calcutta], [1940]. £3,000-4,000
Il Ponte, Feb. 25-26: HAMILTON, Sir William (1730-1803) - Campi Phlegraei. Napoli: [Pietro Fabris], 1776, 1779. € 30.000 - 50.000
Il Ponte, Feb. 25-26: [MORTIER] - BLAEU, Joannes (1596-1673) - Het Nieuw Stede Boek van Italie. Amsterdam: Pieter Mortier, 1704-1705. € 15.000 - 25.000
Il Ponte, Feb. 25-26: TULLIO D'ALBISOLA (1899-1971) - Bruno MUNARI (1907-1998) - L'Anguria lirica (lungo poema passionale). Roma e Savona: Edizioni Futuriste di Poesia, senza data [ma 1933?]. € 20.000 - 30.000
Il Ponte, Feb. 25-26: IL MANOSCRITTO RITROVATO DI IPPOLITA MARIA SFORZA. TITO LIVIO - Ab Urbe Condita. Prima Decade. Manoscritto miniato su pergamena, metà XV secolo. € 280.000 - 350.000
Sotheby's Fine Books & Manuscripts Available for Immediate Purchase
Sotheby’s: Balthus, Emily Brontë. Wuthering Heights, New York: The Limited Editions Club, 1993. 6,600 USD.
Sotheby’s: Charles Dickens. Complete Works, Philadelphia & London: J.B. Lippincott Company & Chapman & Hall, LD, 1850. Limited Edition set of 30 volumes. 7,500 USD.
Sotheby’s: John Lennon, Yoko Ono. Handwritten Letter from John Lennon and Yoko Ono to their Chauffer. 1971. 32,500 USD.
Sotheby’s: Winston Churchill. First edition of War Speeches, Cassell and Company, Ltd., 1941. Set of 7 volumes. 5,500 USD.
Sotheby’s: Andy Warhol, Julia Warhola. Holy Cats First Edition, Signed by Andy Warhol. 1954. 30,000 USD.