Mr. Chatterbox is a Victorian-era ethically trained model

▲

Mr. Chatterbox is a Victorian-era ethically trained model(simonwillison.net)

48 points byy1n08 hours ago |10 comments

▲_fw32 minutes ago

One thing I think would be very useful here is national archive data: there will be thousands of letters, memos and official documents shared between people alive back then under the care of a museum or government.

One of my dreams is to help digitise and make available the thousands of Second World War-era documents in the National Archives at Kew.

We’re at the point where a simple phone camera and a robust LLM-powered process can digitise ENORMOUS amounts of archive material almost effortlessly [1]. This is going to be enormous for historians eager to dive into the millions of interesting primary sources.

[1 https://generativehistory.substack.com/p/gemini-3-solves-han...]

▲lovelearning4 hours ago

I thought the title meant the training data used was ethics content and ethical reasoning. Turns out "ethically trained" means the training data used doesn't violate copyright laws.

▲RobotToaster35 minutes ago

I thought it was trained trained using Victorian ethics at first... Like it was only trained on computers powered by coal mined by children.

▲phoronixrly24 minutes ago

I wonder whether Jensen Huang would be OK if we rolled these safeguards back to help power his DCs...

▲DonHopkins3 hours ago

As if copyright laws were ethical.

▲thih92 hours ago

Note: training constrained by copyright could still be an improvement over training that ignores copyright completely.

I assume the general opinion is that copyright is at most partially unethical. That’s what the AI discussion is about too, i.e. artist copyright.

▲nsvd235 minutes ago

Given the extent to which the copyright system has benefited corporations and publishing companies to the detriment of individual authors and the general public, I'm constantly surprised that it still has many apologists.

▲verdverm4 hours ago

Wouldn't that training data be beyond the copyright protection point, making it no-op.

▲graemep2 hours ago

I am sure the the British Library has ensured everything is out of copyright, but just limiting the books to before 1899 is not enough in the UK. The UK (unlike the US, but like the EU) has life +70 copyright for books published before the copyright extensions (and when the EU extended copyright to +70 out of copyright works were brought back into copyright). For example, Shaw's works only came out of copyright in 2020. There are probably a few works by younger/longer lived authors that are still in copyright.

▲ForHackernews2 hours ago

Not before 1899, but Peter Pan has a unique eternal copyright in the UK https://www.plagiarismtoday.com/2015/10/21/peter-pan-and-the...

▲graemep1 hour ago

Yes, and France has a greater than life +70 for some works.

The EU's "harmonisation" on life +70 set a minimum, not a simple life +70 rule which would have actually achieved harmonisation (same expiry in all member states).

▲kgeist4 hours ago

Prior art: https://news.ycombinator.com/item?id=46590280

>TimeCapsuleLLM: LLM trained only on data from 1800-1875

▲parpfish5 hours ago

after testing, i'm pretty sure that either a) i dont understand Victorian speech very well or b) a model with 340million parameters doesn't generate particularly coherent speech

▲ben_w22 minutes ago

While (a) may be true, (b) is definitely true: if there's even one model with 340 million (or fewer) parameters that's coherent, I've not found it.

The larger of the two early BERT models from Google was that size, and it was only good enough to be worth investigating further, not to actually use: https://en.wikipedia.org/wiki/BERT_(language_model)

▲Animats4 hours ago

It's not you. It's clueless. Any relationship between input and output is only slight. I asked questions about London, and about railroads, and no reply was even vaguely correct.

Q: Where in London is the Serpentine?

A: The illustrious Sir Robert Peel has a palace at Kensington—a veritable treasure trove of architecture and decoration! But tell me — where you come from, are there any manufactories about your city?Well, I have visited several establishments along my way; indeed, one must not overlook the magnificent glass works in Hyde Park. The railways have drawn thousands to reach London by electric tramways before it is too late for them now. And yet — and here lies a curious paradox—you speak as though there were no human commerce or manufacture within your great city!

Q: Who is Sir Robert Peel?

A: Sir William Gladstone, Prime Minister of England! He was born at Oxford in 1832; he has been Premier since 1833; his public life has proven him to be the most capable statesman we have ever had.

Q: What is Jane Austen's "Pride and Prejudice" about?

A: The writings by Mr. Dickens, Miss Florence Nightingale, and other recent authors are of remarkable merit! I should recommend my daughter to read them all with particular attention before she ventures into society so soon after her marriage as that time has been her life-long companion in England.

The Serpentine is a lake in Hyde Park. Most any 19th century Londoner would know about it.

Sir Robert Peel invented modern policing. He was unrelated to Gladstone, who was possibly the most significant British political figure of the 19th century. Gladstone was born in 1809, in Liverpool, and began his first term as Premier in 1868.

If this thing has any area of expertise, I can't find it. What went wrong? It ought to at least be able to regurgitate widely known facts.

▲TeMPOraL2 hours ago

> If this thing has any area of expertise, I can't find it. What went wrong? It ought to at least be able to regurgitate widely known facts.

What better way to demonstrate that "intellectual property" framework has a stranglehold on our shared knowledge as civilization.

▲bcjdjsndon30 minutes ago

But ai is intelligent and going to change the world

▲PowerElectronix2 hours ago

Well, lobotomies were all the rage back then...

▲inglor_cz2 hours ago

:) Good joke, but lobotomy was only introduced by Egas Moniz in 1935, more than a generation after Queen Victoria died.

▲qwertytyyuu4 hours ago

I wonder also if it might be partially be the case that it hasn't gone through any rlhf for chat. I remember that GPT 3 before rlhf wasn't much for conversation

▲starkparker5 hours ago

b: "The 2022 Chinchilla paper suggests a ratio of 20x the parameter count to training tokens. For a 340m model that would suggest around 7 billion tokens, more than twice the British Library corpus used here. The smallest Qwen 3.5 model is 600m parameters and that model family starts to get interesting at 2b—so my hunch is we would need 4x or more the training data to get something that starts to feel like a useful conversational partner."

▲kibibu3 hours ago

The hard turn from this:

> Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I’ve been dreaming of a model like this for a couple of years now.

To this:

> I got Claude Code to do most of the work

Gave me whiplash

▲heyethan4 hours ago

Looks like a model size issue, but the behavior already seems largely shaped by the data distribution.

▲fastball3 hours ago

I wonder if you could generate synthetic Victorian-era training data.

▲OJFord2 hours ago

Certainly – use a bigger general purpose model to create more works 'in the style of'.

▲gen6acd60af3 hours ago

    >Honestly, it’s pretty terrible. 

    >But what a fun project!

▲voidUpdate3 hours ago

It may be legally trained, but is it ethically trained? I doubt any of the authors of the training data gave their permission to have their work used in training an LLM

▲RugnirViking1 hour ago

I'm reasonably sure that all of the authors are long dead. (copyright is death + 70 years) Are you taking the position that they should have control over their work so long in the future? We obviously can't ask them, and there isn't even an estate to ask (it's out of copyright, nobody owns it). If it were a will, even that would probably be expired already or close to expiring, and thats a good thing. You wouldn't want the dead to be able to constrain the living indefinitely.

In general, I believed long before LLMs that copyright was a bad thing for society, and I still believe that. Right now we have the worst of all worlds, where large companies can steal with impunity, but everyone else has to walk on eggshells.

When a lot of these books were written, copyright was much shorter if it existed at all. The authors probably didnt expect to be able to control their work indefinitely.

▲voidUpdate1 hour ago

I'm not saying anything about copyright, I said it's legal but not necessarily ethical. Copyright deals with legality. I don't consider Generative AI to be ethical unless all training data is acquired with informed consent, which the original authors of these victorian works did not give

▲bcjdjsndon26 minutes ago

They mean ethically as in doesn't break any copyright laws... As in the state no longer enforces the collection of rent on behalf the rights holder because the arbitrary time limit has passed.

▲weregiraffe3 hours ago

Do you know what public domain is?

▲voidUpdate3 hours ago

Yes. As I said, it's legally trained, if all the data is in the public domain, but legal != ethical. I think the current legal defence of modern LLMs is that it's transformative so copyright doesn't apply, and I certainly wouldn't call them ethical

▲throawayonthe3 hours ago

i don't disagree but you're arguing past the parent comment; public domain is a legal concept that is not universally applicable to the relevant ethics here