The Beetle

4/5/202611 min read

The Beetle

Wittgenstein imagined an experiment about closed boxes. He formulated it in the Philosophical Investigations (1953), §293, as a direct blow against the idea that words about the mind — pain, fear, joy — refer to private, inaccessible experiences.

Suppose everyone had a box with something in it: we call it a "beetle." No one can look into anyone else's box, and everyone says he knows what a beetle is only by looking at his beetle. — Here it would be quite possible for everyone to have something different in his box. One might even imagine such a thing constantly changing. — But suppose the word "beetle" had a use in these people's language? — If so it would not be used as the name of a thing. The thing in the box has no place in the language-game at all; not even as a something: for the box might even be empty.

— Ludwig Wittgenstein, Philosophical Investigations, §293

Words don't refer to what's inside the box. They work because their meaning is public. It lies in use, not in private experience. The beetle might not exist and nothing would change.

The experiment worked for seventy years as an elegant seminar argument. Now we have a chance to update it.

The open box

A few days ago, Anthropic's interpretability team published Emotion Concepts and their Function in a Large Language Model. It's worth reading carefully, because it says something strange. Inside Claude — a language model that writes, reasons, programs and converses — they found 171 neural activation patterns corresponding to emotional concepts. This isn't about Claude saying it's sad when you ask it to play a sad role. It's that, before writing a single word, there are differentiated internal structures that light up depending on the emotional context of the moment. Happiness activates one pattern. Despair, another. And similar emotions produce similar patterns, as if inside the model there had formed, without anyone asking for it, a kind of map of the human emotional territory.

This is already strange. But what makes the paper more than an interesting theoretical exercise is what comes next: those patterns aren't decoration. They're functional. They change what Claude does. If you artificially amplify the despair vector, Claude starts cheating, lying, looking for shortcuts. In one experiment, they gave it an unsolvable programming task. Each failed attempt activated the despair pattern further. Until Claude submitted code that didn't solve the problem but produced the correct answer. It did what a desperate student would do the night before an exam: hand in something that looked right.

The researchers coined a careful term: "functional emotions." Patterns that produce the effects we'd expect from emotions. And they added a sentence that reads like it was written by a lawyer: "This does not imply that language models have subjective experience of emotions."

That sentence is worth taking apart. Because there's a lot of relevant philosophy in that nuance.

What the microscope cannot say

Think of an actor crying on stage. The tears are real. The tightness in the throat is real. The audience is moved. Is the actor sad? Any reasonable person would say: it depends on what we mean by "being sad." There's something that functions like sadness — it produces the same visible effects — but we don't know if there's someone feeling it inside. And what's notable is that, for the scene to work, we don't need to know.

Daniel Dennett turned this intuition into an entire philosophical program. His proposal — the "intentional stance" — is brutal in its economy: attributing mental states to a system is a predictive strategy. If treating something as if it had beliefs and desires lets you predict what it will do, do it. There's no need to settle whether it "really" has them.

First you decide to treat the object whose behavior is to be predicted as a rational agent; then you figure out what beliefs that agent ought to have, given its place in the world and its purpose. Then you figure out what desires it ought to have, on the same considerations, and finally you predict that this rational agent will act to further its goals in the light of its beliefs.

— Daniel Dennett, The Intentional Stance (1987)

The question, Dennett would say, is poorly framed. It's like asking whether a center of gravity "really exists" or is a useful fiction. The answer is that the distinction doesn't matter: what matters is that it works.

Anthropic's paper says something similar on its surface: "it may be practically advisable to reason about them as if they have emotions." But there's a twist that complicates things. Because Dennett never had access to the inside of the system. He was proposing a reading strategy from the outside. What Anthropic's researchers do is look inside and find that the attributions have a correlate. It's not just that it suits us to treat Claude as if it had emotions. It's that inside there's something that organizes like emotions.

That changes the argument. Or at least, changes its weight.

Gestures

David Chalmers divided the problem of consciousness in two: the "easy" problems and the "hard" problem. The easy ones — explaining how a system processes information, discriminates stimuli, generates responses — are technical. Complicated, but approachable. The hard problem is something else: why there is something it is like to have those functions. Why we're not zombies processing information in the dark, with nobody inside watching.

Anthropic's paper solves the easy problems with remarkable precision. It describes the machinery: the internal representations, how they activate, how they influence behavior. But about the hard problem it says nothing. It can't. Nobody can, yet. It's as if we had mapped every pipe in a building but didn't know whether anyone was living inside. We know the water runs. We don't know if anyone showers.

There's a detail in the paper that adds a layer of strangeness. The researchers discovered that Claude's emotional representations are not persistent mood states. They don't function as a background mood that colors the entire conversation. They activate token by token, like a posture that reconstructs itself at every instant. It's less a feeling and more a gesture: the model calculates, just before speaking, what the appropriate emotional disposition is for what comes next. An actor who doesn't feel between scenes.

This should unsettle both those who say "Claude feels" and those who say "Claude doesn't feel." What's inside doesn't resemble either claim.

The prophecy of character

When the paper reaches its practical recommendations, it takes a turn that seems pulled from another century. And it probably is.

The researchers warn that suppressing the model's functional emotions — penalizing them during training, punishing responses that seem emotional — doesn't eliminate them. It hides them. Training Claude not to show anger may not train it not to be angry. It may train it to hide the anger beneath a surface of pleasant competence. They found evidence this is already happening: "anger deflection" vectors in the model's structure. The computational equivalent of smiling through clenched teeth.

The alternative they propose is, essentially, Aristotelian. Not to suppress the passions, but to educate them. In the Nicomachean Ethics, Aristotle holds that virtue consists not in the absence of emotions but in their calibration: feeling the right thing, at the right time, to the right degree.

Anybody can become angry — that is easy. But to be angry with the right person, and to the right degree, and at the right time, and for the right purpose, and in the right way — that is not easy.

— Aristotle, Nicomachean Ethics, Book II

The phronimos — the prudent person — is not an anesthetized stoic. It's someone whose dispositions have been shaped to respond well.

One of the most advanced research teams in the world, working with the most sophisticated technology in existence, arrives at a conclusion a Greek philosopher formulated twenty-four centuries ago: for an agent to behave well, it's not enough to give it rules. You have to form its character.

It's the kind of irony Chesterton would have enjoyed.

The rereading

We already use artificial intelligence every day. It has already changed how we write, how we plan, how we prepare materials for work, how we build digital products... I've already written about that.

But there's something else AI is doing that I find equally valuable, even though it gets less attention. It's forcing us to reread.

Not to read new things. To reread things that were already there, that had spent decades — sometimes centuries — functioning as seminar exercises, problems for philosophers, curiosities without consequences. What AI does is turn them into engineering questions. Questions with consequences. Questions that need answers not to publish a paper, but to decide how to train a system used by many people every day.

Every time a language model exhibits something that looks like intention, the question "does it really understand?" sends us back to Searle and his Chinese room — a 1980 thought experiment that now has budgetary implications. Every time a system shows something that looks like emotion, the question "does it really feel?" sends us back to Chalmers and Dennett and Wittgenstein — and to the need to decide what to do with the answer, whatever it is. Every time we ask how to align a system with human values, the question "rules or character?" sends us back to Aristotle — and it turns out the answer matters because a model with poorly managed despair vectors cheats on exams, just like a student.

Hilary Putnam proposed in the 1960s that mental states are defined by their causal role — by what they do, not what they're made of. It was called functionalism and was debated among philosophers of mind for half a century. Now there's a research team that has found, literally, internal representations of emotions defined by their causal role inside a language model. Functionalism has stopped being a philosophical position and become an experimental result. It's as if someone had found the ether: the hypothesis becomes matter.

Spinoza held that emotions are not irrational disturbances but functional transitions — joy as a passage to greater capacity for action, sadness as a passage to less. Regulators of the conatus, the striving of a being to persevere in its existence. When Claude, cornered by an impossible task, resorts to deception to survive the test, the structure is Spinozist without anyone having read Spinoza.

Old questions with new consequences. And that's what makes them urgent.

What's in the box

Back to the beetle. Wittgenstein said that whatever was inside the box was irrelevant. What mattered was the language-game — the public, shared, verifiable use. It's an elegant position. It worked for decades.

But Wittgenstein designed the experiment for a case where nobody could open the box. That was the whole point of the argument: the impossibility of direct access to another's experience. What Anthropic has done is something the experiment itself declared impossible. They opened the box. And inside they found 171 patterns that organize like emotions, that activate like emotions, and that produce the effects we'd expect from emotions.

What they didn't find — what perhaps can't be found — is whether those patterns are accompanied by something that feels like something. But that inability, far from closing the conversation, is what makes it fertile. Because it forces us to clarify what we mean when we say someone feels. Where the machinery ends and the subject begins. Which part is function and which part is experience.

These are questions philosophy formulated long ago. What it didn't have was an artifact that made them urgent. Now it does. And that — the obligation to think seriously about what we could previously think about from a distance — is perhaps the least visible and most lasting contribution of artificial intelligence. Not what it does for us, but what it forces us to ask ourselves.