One thing I think would be very useful here is national archive data: there will be thousands of letters, memos and official documents shared between people alive back then under the care of a museum or government.
One of my dreams is to help digitise and make available the thousands of Second World War-era documents in the National Archives at Kew.
We’re at the point where a simple phone camera and a robust LLM-powered process can digitise ENORMOUS amounts of archive material almost effortlessly [1]. This is going to be enormous for historians eager to dive into the millions of interesting primary sources.
I thought the title meant the training data used was ethics content and ethical reasoning. Turns out "ethically trained" means the training data used doesn't violate copyright laws.
Given the extent to which the copyright system has benefited corporations and publishing companies to the detriment of individual authors and the general public, I'm constantly surprised that it still has many apologists.
I am sure the the British Library has ensured everything is out of copyright, but just limiting the books to before 1899 is not enough in the UK. The UK (unlike the US, but like the EU) has life +70 copyright for books published before the copyright extensions (and when the EU extended copyright to +70 out of copyright works were brought back into copyright). For example, Shaw's works only came out of copyright in 2020. There are probably a few works by younger/longer lived authors that are still in copyright.
Yes, and France has a greater than life +70 for some works.
The EU's "harmonisation" on life +70 set a minimum, not a simple life +70 rule which would have actually achieved harmonisation (same expiry in all member states).
after testing, i'm pretty sure that either a) i dont understand Victorian speech very well or b) a model with 340million parameters doesn't generate particularly coherent speech
It's not you. It's clueless. Any relationship between input and output is only slight. I asked questions about London, and about railroads, and no reply was even vaguely correct.
Q: Where in London is the Serpentine?
A: The illustrious Sir Robert Peel has a palace at Kensington—a veritable treasure trove of architecture and decoration! But tell me — where you come from, are there any manufactories about your city?Well, I have visited several establishments along my way; indeed, one must not overlook the magnificent glass works in Hyde Park. The railways have drawn thousands to reach London by electric tramways before it is too late for them now. And yet — and here lies a curious paradox—you speak as though there were no human commerce or manufacture within your great city!
Q: Who is Sir Robert Peel?
A: Sir William Gladstone, Prime Minister of England! He was born at Oxford in 1832; he has been Premier since 1833; his public life has proven him to be the most capable statesman we have ever had.
Q: What is Jane Austen's "Pride and Prejudice" about?
A: The writings by Mr. Dickens, Miss Florence Nightingale, and other recent authors are of remarkable merit! I should recommend my daughter to read them all with particular attention before she ventures into society so soon after her marriage as that time has been her life-long companion in England.
The Serpentine is a lake in Hyde Park. Most any 19th century Londoner would know about it.
Sir Robert Peel invented modern policing. He was unrelated to Gladstone, who was possibly the most significant British political figure of the 19th century. Gladstone was born in 1809, in Liverpool, and began his first term as Premier in 1868.
If this thing has any area of expertise, I can't find it. What went wrong? It ought to at least be able to regurgitate widely known facts.
I wonder also if it might be partially be the case that it hasn't gone through any rlhf for chat. I remember that GPT 3 before rlhf wasn't much for conversation
b: "The 2022 Chinchilla paper suggests a ratio of 20x the parameter count to training tokens. For a 340m model that would suggest around 7 billion tokens, more than twice the British Library corpus used here. The smallest Qwen 3.5 model is 600m parameters and that model family starts to get interesting at 2b—so my hunch is we would need 4x or more the training data to get something that starts to feel like a useful conversational partner."
> Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I’ve been dreaming of a model like this for a couple of years now.
It may be legally trained, but is it ethically trained? I doubt any of the authors of the training data gave their permission to have their work used in training an LLM
I'm reasonably sure that all of the authors are long dead. (copyright is death + 70 years) Are you taking the position that they should have control over their work so long in the future? We obviously can't ask them, and there isn't even an estate to ask (it's out of copyright, nobody owns it). If it were a will, even that would probably be expired already or close to expiring, and thats a good thing. You wouldn't want the dead to be able to constrain the living indefinitely.
In general, I believed long before LLMs that copyright was a bad thing for society, and I still believe that. Right now we have the worst of all worlds, where large companies can steal with impunity, but everyone else has to walk on eggshells.
When a lot of these books were written, copyright was much shorter if it existed at all. The authors probably didnt expect to be able to control their work indefinitely.
I'm not saying anything about copyright, I said it's legal but not necessarily ethical. Copyright deals with legality. I don't consider Generative AI to be ethical unless all training data is acquired with informed consent, which the original authors of these victorian works did not give
They mean ethically as in doesn't break any copyright laws... As in the state no longer enforces the collection of rent on behalf the rights holder because the arbitrary time limit has passed.
Yes. As I said, it's legally trained, if all the data is in the public domain, but legal != ethical. I think the current legal defence of modern LLMs is that it's transformative so copyright doesn't apply, and I certainly wouldn't call them ethical
i don't disagree but you're arguing past the parent comment; public domain is a legal concept that is not universally applicable to the relevant ethics here
One of my dreams is to help digitise and make available the thousands of Second World War-era documents in the National Archives at Kew.
We’re at the point where a simple phone camera and a robust LLM-powered process can digitise ENORMOUS amounts of archive material almost effortlessly [1]. This is going to be enormous for historians eager to dive into the millions of interesting primary sources.
[1 https://generativehistory.substack.com/p/gemini-3-solves-han...]
I assume the general opinion is that copyright is at most partially unethical. That’s what the AI discussion is about too, i.e. artist copyright.
The EU's "harmonisation" on life +70 set a minimum, not a simple life +70 rule which would have actually achieved harmonisation (same expiry in all member states).
>TimeCapsuleLLM: LLM trained only on data from 1800-1875
The larger of the two early BERT models from Google was that size, and it was only good enough to be worth investigating further, not to actually use: https://en.wikipedia.org/wiki/BERT_(language_model)
Q: Where in London is the Serpentine?
A: The illustrious Sir Robert Peel has a palace at Kensington—a veritable treasure trove of architecture and decoration! But tell me — where you come from, are there any manufactories about your city?Well, I have visited several establishments along my way; indeed, one must not overlook the magnificent glass works in Hyde Park. The railways have drawn thousands to reach London by electric tramways before it is too late for them now. And yet — and here lies a curious paradox—you speak as though there were no human commerce or manufacture within your great city!
Q: Who is Sir Robert Peel?
A: Sir William Gladstone, Prime Minister of England! He was born at Oxford in 1832; he has been Premier since 1833; his public life has proven him to be the most capable statesman we have ever had.
Q: What is Jane Austen's "Pride and Prejudice" about?
A: The writings by Mr. Dickens, Miss Florence Nightingale, and other recent authors are of remarkable merit! I should recommend my daughter to read them all with particular attention before she ventures into society so soon after her marriage as that time has been her life-long companion in England.
The Serpentine is a lake in Hyde Park. Most any 19th century Londoner would know about it.
Sir Robert Peel invented modern policing. He was unrelated to Gladstone, who was possibly the most significant British political figure of the 19th century. Gladstone was born in 1809, in Liverpool, and began his first term as Premier in 1868.
If this thing has any area of expertise, I can't find it. What went wrong? It ought to at least be able to regurgitate widely known facts.
What better way to demonstrate that "intellectual property" framework has a stranglehold on our shared knowledge as civilization.
> Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I’ve been dreaming of a model like this for a couple of years now.
To this:
> I got Claude Code to do most of the work
Gave me whiplash
In general, I believed long before LLMs that copyright was a bad thing for society, and I still believe that. Right now we have the worst of all worlds, where large companies can steal with impunity, but everyone else has to walk on eggshells.
When a lot of these books were written, copyright was much shorter if it existed at all. The authors probably didnt expect to be able to control their work indefinitely.