Our commitment to community safety
Key Points
- automated + human review for risk detection
- zero‑tolerance enforcement and escalation
- parental controls and crisis referrals
Summary
OpenAI describes how ChatGPT is trained and operated to reduce real‑world violence and harm. The approach combines model-level refusals, automated risk detection across conversations, contextual human review, escalation protocols (including law enforcement when imminent risk is detected), enforcement actions (account bans and access revocation), and support/referral for users in crisis. The post also highlights parental controls, an upcoming trusted‑contact feature, and ongoing model improvements guided by external experts.
Key Points
- Model behavior: systems trained to maximize helpfulness while refusing requests that could meaningfully enable violence; allow factual/educational discussion while omitting operational details.
- Detection stack: automated classifiers, reasoning models, hash‑matching, blocklists and behavior signals analyze content and patterns across long conversations.
- Human review: flagged content is assessed in context by trained reviewers with limited, secure access to user data; reviewers apply structured criteria to determine risk and next steps.
- Enforcement: zero‑tolerance for using tools to assist violence — immediate account disablement, cross‑account blocking, and steps to detect new accounts; users may appeal decisions.
- Escalation: higher‑risk cases undergo deeper investigation; imminent, credible threats trigger law enforcement notification; mental‑health experts inform referral criteria.
- User support features: localized crisis resources surfaced in conversations, parental controls linking parent/teen accounts (no parental access to messages), and a forthcoming trusted‑contact option for adults.
- Engineering implications: combine signal types, prioritize longitudinal context, enforce strong privacy controls for reviewers, maintain robust escalation workflows, and iterate with expert feedback.
For more detail, refer to the Model Spec and the Usage Policies linked in the original post.