The level of detail they had to delve into in order to understand what was happe...

Orygin · 2026-04-30T09:16:40 1777540600

It didn't seem that deep to me. They just saw an issue with Goblins, dissected the word from the model, then it appeared again in the next version without them knowing exactly how or why.

Goes to show it's all vibes when making these models. The fix is literally a prompt that says not to talk about goblins...

meken · 2026-04-30T10:42:29 1777545749

I’m not sure how that was your takeaway..?

> We retired the “Nerdy” personality in March after launching GPT‑5.4. In training, we removed the goblin-affine reward signal and filtered training data containing creature-words, making goblins less likely to over-appear or show up in inappropriate contexts. Unfortunately, GPT‑5.5 started training before we found the root cause of the goblins.

The prompt is just a short term hotfix/hack because they couldn’t get the proper fix in in time.

Orygin · 2026-04-30T12:27:34 1777552054

Then maybe stop training and make a real fix?

If you need to put baby guardrails on your model because the training is effed up, maybe you should rethink how you make these models and how much control you really have on it.

luke-stanley · 2026-04-30T13:04:00 1777554240

It's a funny detail to skim, but what's more surprising is how mechanistic interpretability and alignment science have much better tools and research than the goblin blog post suggests, including from OpenAI's own alignment team:

https://alignment.openai.com/argo/ (finding what the reward models are actually encouraging) https://alignment.openai.com/sae-latent-attribution/ (what model features drive specific behaviours, presumably this would be great for goblin hunts) https://alignment.openai.com/helpful-assistant-features/ (how high level misaligned personality shows up when fine-tuning on bad advice).

It's weird that the goblin post doesn't seem to draw upon these tools.

Anthropic's recent emotions paper shows how broad the functional emotions are, even finding specific emotions firing before cheating (!): https://transformer-circuits.pub/2026/emotions/index.html

I hope their alignment researchers aren't too annoyed by the Goblin post, it seems oddly siloed!

alansaber · 2026-04-30T09:00:45 1777539645

This is a little bit too whimsical for me, but distributed model training across thousands of GPUs has the potential to introduce lots of little quirks that are impossible to exactly source

Razengan · 2026-04-30T07:25:27 1777533927

> The quanta article referenced at [1] used the term "Anthropologist of Artificial Intelligence"

I propose "Goblin Hunter"

(if ever goblins turn out to be an actual species, I apologize for this prebigotry)

gizajob · 2026-04-30T08:10:30 1777536630

AI Goblinologist.