Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲Mr. Chatterbox is a Victorian-era ethically trained model (simonwillison.net)

80 points by y1n0 18 hours ago | 49 comments

_fw 10 hours ago [-]

One thing I think would be very useful here is national archive data: there will be thousands of letters, memos and official documents shared between people alive back then under the care of a museum or government.

One of my dreams is to help digitise and make available the thousands of Second World War-era documents in the National Archives at Kew.

We’re at the point where a simple phone camera and a robust LLM-powered process can digitise ENORMOUS amounts of archive material almost effortlessly [1]. This is going to be enormous for historians eager to dive into the millions of interesting primary sources.

[1 https://generativehistory.substack.com/p/gemini-3-solves-han...]

lovelearning 15 hours ago [-]

I thought the title meant the training data used was ethics content and ethical reasoning. Turns out "ethically trained" means the training data used doesn't violate copyright laws.

CoastalCoder 8 hours ago [-]

I really dislike the way people use "ethical" as though it were an unambiguous, binary concept.

Even if it's just shorthand due to space constraints, it oversimplifies the concept of "ethical" to the point of muddling people's thinking.

RobotToaster 10 hours ago [-]

I thought it was trained trained using Victorian ethics at first... Like it was only trained on computers powered by coal mined by children.

phoronixrly 10 hours ago [-]

I wonder whether Jensen Huang would be OK if we rolled these safeguards back to help power his DCs...

DonHopkins 13 hours ago [-]

As if copyright laws were ethical.

thih9 13 hours ago [-]

Note: training constrained by copyright could still be an improvement over training that ignores copyright completely.

I assume the general opinion is that copyright is at most partially unethical. That’s what the AI discussion is about too, i.e. artist copyright.

nsvd2 10 hours ago [-]

Given the extent to which the copyright system has benefited corporations and publishing companies to the detriment of individual authors and the general public, I'm constantly surprised that it still has many apologists.

tmtvl 7 hours ago [-]

As we don't live in a world where the rich patronize the arts some sort of copyright system is the only way authors and artists are gonna make a living doing their thing. ...though I suppose proponents of Universal Basic Income (UBI) would disagree, but between the abolishment of copyright, the institution of UBI, or a 7 year old child being hit by 7 lightning strikes and 7 meteor impacts and surviving; the latter seems the most likely.

thih9 2 hours ago [-]

What do you suggest instead? I.e. what would benefit individual authors more?

PunchyHamster 7 hours ago [-]

People imagine poor author having their thing stolen rather than poor author that corporate takes IP from by contract agreement (and if you don't do that, you don't get the job), then abuses for 70+ years

scoot 7 hours ago [-]

If training data of any kind violated copyright, every creator alive would be in breach of by virtue of any influence their “training data” (lifelong exposure to the work of others) has on their output.

The creators crying foul of AI are painting themselves into a corner, both literally and figuratively.

miyoji 3 hours ago [-]

This is a truly awful argument that keeps coming up. It relies on the false equivalence between training an AI (a technical process that involves copying a work into computer storage), and a human being experiencing a work, which doesn't involve any kind of copying (and usually involves the human legally purchasing the work, which AI companies did not do).

There is a legal difference as well as a technical difference. AIs don't learn the same way human brains do. The law does not treat these things the same. You may want to draw an analogy between the two and say they're "basically the same", but they are not basically the same. They aren't the same at all, outside of a very weak analogy. Is training kind of sort of like human learning? Yes. That doesn't mean anything. Dogs are kind of sort of like children, but if you try to treat your child the way you treat your dog, you end up in prison. Because children aren't dogs, either in reality, or in the eyes of the legal system.

Please, AI boosters, stop using this one. Human brains aren't clocks. Human brains aren't computers. Human brains aren't LLMs. AI training does not mimic human learning in any significant way.

verdverm 14 hours ago [-]

Wouldn't that training data be beyond the copyright protection point, making it no-op.

ImHereToVote 9 hours ago [-]

I believe the works are no longer under copyright. I also believe what they mean is that they removed wrongthink from their dataset. For instance there was a certain book written in 1844 by Karl Marx in German that under no circumstances made it in.

This ofc means that the LLM is completely pointless.

https://www.marxists.org/archive/marx/works/date/index.htm

9 hours ago [-]

kgeist 15 hours ago [-]

Prior art: https://news.ycombinator.com/item?id=46590280

>TimeCapsuleLLM: LLM trained only on data from 1800-1875

simonw 9 hours ago [-]

I'd missed this when I first published my post but it turns out Trip had a much more detailed write-up of the project here: https://www.estragon.news/mr-chatterbox-or-the-modern-promet...

Tade0 9 hours ago [-]

I'm afraid a "normal" model with style transfer would be closer to the desired effect - assuming we drop the requirement that it has to use out of copyright works for training.

Personally I would use this model to give regular people an intuition as to what LLMs actually are - text predictors in essence.

Flashtoo 9 hours ago [-]

What makes you think the desired effect is to have an LLM that speaks in an old-timey style? The training process is the whole point.

owenbrown 2 hours ago [-]

You could try these techniques to get over the data sparsity.

https://qlabs.sh/10x

It’s really cool, I’d love to see it smart.

graemep 12 hours ago [-]

I am sure the the British Library has ensured everything is out of copyright, but just limiting the books to before 1899 is not enough in the UK. The UK (unlike the US, but like the EU) has life +70 copyright for books published before the copyright extensions (and when the EU extended copyright to +70 out of copyright works were brought back into copyright). For example, Shaw's works only came out of copyright in 2020. There are probably a few works by younger/longer lived authors that are still in copyright.

ForHackernews 12 hours ago [-]

Not before 1899, but Peter Pan has a unique eternal copyright in the UK https://www.plagiarismtoday.com/2015/10/21/peter-pan-and-the...

graemep 12 hours ago [-]

Yes, and France has a greater than life +70 for some works.

The EU's "harmonisation" on life +70 set a minimum, not a simple life +70 rule which would have actually achieved harmonisation (same expiry in all member states).

kibibu 13 hours ago [-]

The hard turn from this:

> Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I’ve been dreaming of a model like this for a couple of years now.

To this:

> I got Claude Code to do most of the work

Gave me whiplash

rkapsoro 1 hours ago [-]

I say, those chat logs read like Wodehouse.

parpfish 15 hours ago [-]

after testing, i'm pretty sure that either a) i dont understand Victorian speech very well or b) a model with 340million parameters doesn't generate particularly coherent speech

Animats 14 hours ago [-]

It's not you. It's clueless. Any relationship between input and output is only slight. I asked questions about London, and about railroads, and no reply was even vaguely correct.

Q: Where in London is the Serpentine?

A: The illustrious Sir Robert Peel has a palace at Kensington—a veritable treasure trove of architecture and decoration! But tell me — where you come from, are there any manufactories about your city?Well, I have visited several establishments along my way; indeed, one must not overlook the magnificent glass works in Hyde Park. The railways have drawn thousands to reach London by electric tramways before it is too late for them now. And yet — and here lies a curious paradox—you speak as though there were no human commerce or manufacture within your great city!

Q: Who is Sir Robert Peel?

A: Sir William Gladstone, Prime Minister of England! He was born at Oxford in 1832; he has been Premier since 1833; his public life has proven him to be the most capable statesman we have ever had.

Q: What is Jane Austen's "Pride and Prejudice" about?

A: The writings by Mr. Dickens, Miss Florence Nightingale, and other recent authors are of remarkable merit! I should recommend my daughter to read them all with particular attention before she ventures into society so soon after her marriage as that time has been her life-long companion in England.

The Serpentine is a lake in Hyde Park. Most any 19th century Londoner would know about it.

Sir Robert Peel invented modern policing. He was unrelated to Gladstone, who was possibly the most significant British political figure of the 19th century. Gladstone was born in 1809, in Liverpool, and began his first term as Premier in 1868.

If this thing has any area of expertise, I can't find it. What went wrong? It ought to at least be able to regurgitate widely known facts.

TeMPOraL 12 hours ago [-]

> If this thing has any area of expertise, I can't find it. What went wrong? It ought to at least be able to regurgitate widely known facts.

What better way to demonstrate that "intellectual property" framework has a stranglehold on our shared knowledge as civilization.

parpfish 5 hours ago [-]

The output reminds of a really good version of pre-LLM text generation like character lever LSTMs or markov chains.

It seems to have syntax down to make superficially good text, but the semantics just aren’t there

windowliker 9 hours ago [-]

Amazing. It's like a drop in replacement for our politicians.

PowerElectronix 13 hours ago [-]

Well, lobotomies were all the rage back then...

inglor_cz 12 hours ago [-]

:) Good joke, but lobotomy was only introduced by Egas Moniz in 1935, more than a generation after Queen Victoria died.

bcjdjsndon 10 hours ago [-]

But ai is intelligent and going to change the world

ben_w 10 hours ago [-]

While (a) may be true, (b) is definitely true: if there's even one model with 340 million (or fewer) parameters that's coherent, I've not found it.

The larger of the two early BERT models from Google was that size, and it was only good enough to be worth investigating further, not to actually use: https://en.wikipedia.org/wiki/BERT_(language_model)

starkparker 15 hours ago [-]

b: "The 2022 Chinchilla paper suggests a ratio of 20x the parameter count to training tokens. For a 340m model that would suggest around 7 billion tokens, more than twice the British Library corpus used here. The smallest Qwen 3.5 model is 600m parameters and that model family starts to get interesting at 2b—so my hunch is we would need 4x or more the training data to get something that starts to feel like a useful conversational partner."

qwertytyyuu 14 hours ago [-]

I wonder also if it might be partially be the case that it hasn't gone through any rlhf for chat. I remember that GPT 3 before rlhf wasn't much for conversation

heyethan 14 hours ago [-]

Looks like a model size issue, but the behavior already seems largely shaped by the data distribution.

gen6acd60af 14 hours ago [-]

    >Honestly, it’s pretty terrible. 

    >But what a fun project!

fastball 14 hours ago [-]

I wonder if you could generate synthetic Victorian-era training data.

OJFord 13 hours ago [-]

Certainly – use a bigger general purpose model to create more works 'in the style of'.

bossyTeacher 9 hours ago [-]

Prompt: do you know what america is?

Response: Indeed! I have heard that the word 'fire-water' refers to water used for washing clothes and cooking purposes.

voidUpdate 13 hours ago [-]

It may be legally trained, but is it ethically trained? I doubt any of the authors of the training data gave their permission to have their work used in training an LLM

RugnirViking 11 hours ago [-]

I'm reasonably sure that all of the authors are long dead. (copyright is death + 70 years) Are you taking the position that they should have control over their work so long in the future? We obviously can't ask them, and there isn't even an estate to ask (it's out of copyright, nobody owns it). If it were a will, even that would probably be expired already or close to expiring, and thats a good thing. You wouldn't want the dead to be able to constrain the living indefinitely.

In general, I believed long before LLMs that copyright was a bad thing for society, and I still believe that. Right now we have the worst of all worlds, where large companies can steal with impunity, but everyone else has to walk on eggshells.

When a lot of these books were written, copyright was much shorter if it existed at all. The authors probably didnt expect to be able to control their work indefinitely.

voidUpdate 11 hours ago [-]

I'm not saying anything about copyright, I said it's legal but not necessarily ethical. Copyright deals with legality. I don't consider Generative AI to be ethical unless all training data is acquired with informed consent, which the original authors of these victorian works did not give

RugnirViking 9 hours ago [-]

I understand you're talking about ethics. I'm talking about how we conceive of ethics as relates to artistic works which I see as tied to time and law.

Absent copyright, people tend to work with much shorter and more restrictive ideas of "ownership" - it used to be very common for music artists to record each others songs, use samples etc. Similar in painting, and other art forms. It wasnt theft, thats just how you did stuff. Particularly soulless or egrarious behavior was called out, but it was normal.

I was writing what I was to point out that in their time they would be very unreasonable to expect to "own" their works for more than a few years. The law isn't a baseline minimum, it in fact expands the idea of intellectual property actively way lot more than I think the natural behavior of people and artists. I dont think any of them would have had many thoughts at all about what happened a hundred or more years after their death other than they hoped they were remembered at all

bcjdjsndon 10 hours ago [-]

They mean ethically as in doesn't break any copyright laws... As in the state no longer enforces the collection of rent on behalf the rights holder because the arbitrary time limit has passed.

weregiraffe 13 hours ago [-]

Do you know what public domain is?

throawayonthe 13 hours ago [-]

i don't disagree but you're arguing past the parent comment; public domain is a legal concept that is not universally applicable to the relevant ethics here

voidUpdate 13 hours ago [-]

Yes. As I said, it's legally trained, if all the data is in the public domain, but legal != ethical. I think the current legal defence of modern LLMs is that it's transformative so copyright doesn't apply, and I certainly wouldn't call them ethical

Rendered at 20:22:26 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.