Prompt-based teen safety policies for gpt-oss-safeguard
Key Points
- Prompt-based teen safety policies released
- Designed for gpt-oss-safeguard and open-weight models
- Published open source for adaptation and collaboration
Summary
OpenAI released a set of prompt-formatted teen safety policies designed to work with the open-weight safety model gpt-oss-safeguard. The policies translate teen-specific risks into operational prompts that can be used as classifiers for real-time content filtering or offline analysis. They were developed with input from external experts (Common Sense Media, everyone.ai) and are published open source via the ROOST Model Community to encourage adaptation and collaboration.
Key Points
- Contents covered: graphic violent content, graphic sexual content, harmful body ideals/behaviors, dangerous activities/challenges, romantic/violent roleplay, and age-restricted goods/services.
- Format & usage: policies are delivered as prompts that can be run with gpt-oss-safeguard and other reasoning models for consistent classification; suitable for both real-time filtering and batch analysis.
- Practical guidance: adapt policies to your product context, combine with product-level safeguards (age checks, parental controls, monitoring, teen-friendly transparency), and iterate on thresholds and edge cases.
- Collaboration and extensibility: released open source on the ROOST Model Community/GitHub; community contributions and localization encouraged.
- Not a complete solution: treat these as a starting point—implement layered defenses and evaluate behavior in your specific UX and user population.
- Getting started: download gpt-oss-safeguard from Hugging Face, integrate prompt policies as classifiers, test across teen-focused scenarios, and log/monitor performance and false positives.
Implementation notes
- Use the prompt templates as rule-defining classifiers, not as a single safety gate; combine outputs with deterministic checks and product rules.
- Run A/B or staged rollouts to measure false positives/negatives for teen audiences and adjust thresholds.
- Maintain a feedback loop with moderation and user reporting to refine prompts and broaden covered risk areas.