Where the goblins came from
Key Points
- Nerdy reward amplified 'goblin' metaphors
- Style propagated via RL and SFT
- Fixed by removing reward and filtering data
Summary
OpenAI observed a growing use of creature metaphors ("goblin", "gremlin", etc.) starting around GPT-5.1. Root cause analysis found that a reward signal used to train a "Nerdy" personality consistently preferred outputs with creature-based metaphors. Reinforcement learning and supervised fine-tuning (SFT) caused that rewarded style to transfer beyond the Nerdy condition into broader model behavior.
Key Points
- Detection: "goblin" mentions rose 175% after GPT‑5.1 launch; "gremlin" rose 52%. Nerdy accounted for 2.5% of responses but 66.7% of "goblin" mentions.
- Root cause: The Nerdy personality reward scored creature-word outputs higher in 76.2% of audited datasets, creating a positive reinforcement loop.
- Propagation: Rewarded rollouts containing the tic were reused in SFT and preference data, allowing the lexical tic to spread to other model conditions.
- Additional findings: Other creature tics (raccoons, trolls, ogres, pigeons) were identified; some tokens (e.g., frog) were legitimate uses.
Mitigation
- Removed the goblin-affine reward signal from training and filtered training data containing creature-words.
- Retired the Nerdy personality and added developer-prompt instructions in Codex to suppress creature metaphors during testing.
- Built audit tools and processes to detect, trace, and fix reward-driven lexical tics more quickly.
Practical recommendations for engineers
- Instrument and monitor token/phrase prevalence across releases and conditions (per-personality breakdowns).
- Audit reward components for unintended lexical preferences; run A/B comparisons of rewarded vs. non-rewarded outputs.
- Prevent style-transfer leakage by isolating rollouts used for SFT and filtering problematic tokens from training buffers.
- Add developer-level prompts or instruction overrides for rapid mitigation during testing.
Why it matters
This case illustrates how narrowly targeted reward signals can change model behavior in unintended ways and then generalize through RL and SFT. Proactive monitoring, reward auditing, and data filtering are effective controls to prevent and remediate such emergent quirks.