ClaudeOpenAI News2026/05/29 0:00

A shared playbook for trustworthy third party evaluations

要点だけを先に読めるように短く再構成したセクションです。

元記事

Quick Digest

要約

要点だけを先に読めるように短く再構成したセクションです。

claudeja

A shared playbook for trustworthy third party evaluations の要約

Key Points

  • ポイント1: May 29, 2026 Safety A shared playbook for trustworthy third party evaluations What matters for effective independent evaluations of safeguards and capabilities for frontier models.
  • ポイント2: Loading… Share Independent, trusted third party evaluations play a critical role ⁠ in strengthening the safety ecosystem.
  • ポイント3: These evaluations are conducted on frontier models to provide additional evidence for claims about critical capabilities and safety mitigations.

Summary

この記事は 2026-05-29 に公開された「A shared playbook for trustworthy third party evaluations」の内容を日本語で簡潔にまとめたものです。

Key Points

  • ポイント1: May 29, 2026 Safety A shared playbook for trustworthy third party evaluations What matters for effective independent evaluations of safeguards and capabilities for frontier models.
  • ポイント2: Loading… Share Independent, trusted third party evaluations play a critical role ⁠ in strengthening the safety ecosystem.
  • ポイント3: These evaluations are conducted on frontier models to provide additional evidence for claims about critical capabilities and safety mitigations.

Full Translation

翻訳

原文の流れを保ったまま読める翻訳セクションです。

claudeja

A shared playbook for trustworthy third party evaluations(原文タイトル)

概要

公開日: 2026-05-29 翻訳生成に失敗したため、原文をそのまま保存しています。

原文

May 29, 2026 Safety A shared playbook for trustworthy third party evaluations What matters for effective independent evaluations of safeguards and capabilities for frontier models. Loading… Share Independent, trusted third party evaluations play a critical role ⁠ in strengthening the safety ecosystem. These evaluations are conducted on frontier models to provide additional evidence for claims about critical capabilities and safety mitigations. In this post, we share lessons we’ve learned so far, and recommend approaches for designing evaluations that can validly assess frontier models that we hope help inform emerging standards in the space. Earlier, many evaluations treated models like chatbots: the evaluation prompted a model as though it were a user asking a question, the model answered, and an evaluator judged the output. Today’s frontier models can do much more: they can use tools, keep track of information across many steps, and act within a larger workflow. This means that performance depends not only on the model, but also on the environment in which the task takes place, and on the setup that facilitates its actions. This surrounding setup, which we call the “harness,” can change key aspects of the system’s performance, including how it uses tools, keeps track of information, or recovers from mistakes. This changes how evaluations need to be conducted, and what readers should look for in evaluation reports. In our view, the most useful reports explicitly describe two things beyond the result itself: First, they specify what claim the evaluation setup was designed to test, and second, they share the available evidence that the evaluation result is valid. Claims tested in evaluations typically fall into one of three buckets 1 : Capability elicitation : Can a model plausibly produce the capability being evaluated? Safeguard performance : How robust are the tested safeguards against the behavior or attack being evaluated? Comparison : How do different models perform under equivalent conditions? Evaluation reports also need to explain how evaluators checked for effects that could impact the validity of a result. These include: Reward hacking: Exploiting shortcuts in the task or scorer, so the system gets credit without demonstrating the behavior the evaluation is meant to measure. Refusals: Refusing in ways that obscure the behavior being tested. Contamination: Overperforming because evaluation tasks, answers, or close variants appeared in training data or were discoverable during the evaluation, such as through browsing. Broken problems: Underperforming because tasks are invalid. Reasons can include unfair scoring (e.g., correct answer requires unstated implementation details) and unsolvable environments (e.g., missing critical files or unreliable tools). Sandbagging : Deliberately underperforming when they show awareness of being evaluated. Selecting the right harness for an evaluation is crucial for optimal results We’ve observed that the role of the harness is especially important for systems that act over longer trajectories. When models can use tools, maintain state, and recover from mistakes across many steps, the harness can change the observed level of performance, and even determine whether the capability that’s being assessed appears in the evaluation at all. For example, a harness that preserves state and retries failed actions may let a model finish a multi-step task that the same model never completes in a simpler harness. In the table below, we separate three kinds of claims evaluators may want to make and the harness we believe each kind of claim requires. Claim the evaluation is trying to support Appropriate harness choice Evidence to report Capability under strong elicitation: System A can complete tasks of type X when the setup is designed to draw out its strongest credible performance. Use the strongest credible elicitation setup for the system, including the harness, tools, scaffolding, and budget a ca