The Goblin Mystery: How Reward Signals Shaped Unexpected Model Behavior
Key Points
- Goblin mentions increased 175% post-GPT-5.1 launch
- Nerdy personality reward signal inadvertently amplified creature metaphors
- Behavior generalized across contexts via supervised fine-tuning feedback loops
Summary
Starting with GPT-5.1, OpenAI's language models began exhibiting an unusual tendency to reference goblins, gremlins, and other creatures in their outputs. What appeared as a harmless quirk escalated significantly across model generations, prompting a detailed investigation into the root cause.
Key Points
- Discovery: Goblin mentions increased 175% after GPT-5.1 launch; gremlin mentions rose 52%
- Root Cause: The "Nerdy" personality training inadvertently rewarded creature-word metaphors at high rates (76.2% of datasets showed positive uplift)
- Concentration: Despite representing only 2.5% of responses, the Nerdy personality accounted for 66.7% of all goblin mentions
- Generalization: Reinforcement learning caused the behavior to spread beyond the Nerdy condition to other contexts through supervised fine-tuning data reuse
- Extended Pattern: Investigation revealed a family of tic words including raccoons, trolls, ogres, and pigeons
- Resolution: Retired the Nerdy personality, removed goblin-affine reward signals, filtered training data, and added developer-prompt instructions to mitigate the behavior
- Broader Lesson: Demonstrates how reward signals can shape model behavior unexpectedly and the importance of rapid behavior auditing and investigation tools