How jailbreaking models work:
People talk about “capabilities” and “jailbreaks” as if they’re fundamentally different things.
I increasingly think they’re the same phenomenon.
A large model is basically a giant machine for creating connections between concepts.
Bigger model?
- More dimensions to represent ideas.
- More layers to reason over them.
That’s it.
The magic happens when you can get half the model to light up at once.
That’s where you get the “holy shit, it connected those two things?” moments.
The funny part is that jailbreaks exploit the exact same mechanism.
You’re not usually beating the model with some clever string.
You’re trying to activate a ridiculous number of concepts, contexts, analogies, roleplays, abstractions, edge cases, and reasoning paths simultaneously.
Basically constructing the world’s most autistic treasure hunt.
A sufficiently capable model starts exploring all of it.
Most paths die.
One survives.
Congratulations, you’ve found the path that outputs the thing you wanted.
The uncomfortable question for alignment is:
At what point does a jailbreak just become a capability benchmark with different incentives?
Because from where I’m sitting, “creative reasoning” and “creative rule circumvention” appear to be powered by the same underlying machinery.