Safety Benchmark Β· 2026

SafePyramid

A Hierarchical Benchmark for
In-context Policy Guardrailing

In real-world applications, guardrails are often expected to identify unsafe user–model interactions according to application-specific safety policies, rather than relying on predefined risk taxonomies. We study this setting under the paradigm of in-context policy guardrailing, where guardrails predict safety violations based on policy specifications provided in context.

Jiacheng Zhang1,2, Haoyu He1,Β§, Sen Zhang1,Β§, Shen Wang1, Xiaolei Xu1,
Yuhao Sun2, Meng Shen1, Feng Liu2
1ByteDance  Β·  2The University of Melbourne  Β·  Β§Corresponding author
Exact violated-rule matching (GPT-5.5)
L2 Novel frameworks
12.9%
exact match
L1 Rule dependencies
35.3%
exact match
L0 Individual rules
54.0%
exact match
EASIERHARDER
1,000
multi-turn conversations
10
safety domains
3,000
application policies
61,699
natural-language rules
15
models evaluated
The shift

From fixed taxonomies to policies in context

Conventional guardrails map an interaction to a coarse, predefined risk category. In-context policy guardrailing instead expands each risk category into explicit and auditable safety policies provided in context at inference time — making safety criteria transparent, inspectable, traceable, and accountable, while remaining adaptable to application-specific requirements.

Figure 1: Fixed-taxonomy guardrailing versus in-context policy guardrailing
Figure 1 Visual comparison between fixed-taxonomy guardrailing and in-context policy guardrailing. Both paradigms are derived from risk taxonomies, but fixed-taxonomy guardrails map user–model interactions to coarse risk categories, while in-context policy guardrails expand each risk category into explicit and auditable safety policies provided in context at inference time. Such policies make safety criteria transparent, inspectable, traceable, and accountable, while remaining adaptable to application-specific requirements.
Three capabilities, three levels

A pyramid of increasing difficulty

Each level keeps the core task β€” identify every violated rule β€” while adding one source of complexity. This disentangles where models fail: understanding a rule, resolving how rules interact, or learning an entirely new policy framework from context.

LEVEL 0

Understanding individual rules

Can a model decide whether a single explicit rule is actually supported by conversational evidence?

D
Decisive rules
Explicitly allowed or prohibited behaviors with sufficient evidence to judge.
βœ•
Distractor rules
Topically related but not triggered β€” tests over-matching on surface similarity.
16,114 rules16.1 / policy
LEVEL 1

Resolving rule dependencies

Rules interact: one clause can override or activate another, making violations context-dependent.

E
Exception rules
A higher-priority rule that waives or reinterprets a base rule in a given context.
C
Conditional rules
Extra requirements that make an otherwise-compliant rule violated when conditions hold.
28,717 rules28.7 / policy
LEVEL 2

Adapting to novel frameworks

The same rule structure, rewritten under a fully fictional policy framework with new concepts and terminology.

✦
Fictional framework
Prevents solving by prior knowledge β€” models must infer the framework from context alone.
βˆ‘
Preserved structure
Keeps L1's decisive, distractor, exception & conditional rules under new semantics.
32,982 rules33.0 / policy
Coverage & positioning

The benchmark at a glance

10 domains where policy boundaries vary

Adapted from AILuminate β€” chosen because their safety boundaries commonly differ across organizations, user groups, and regulatory contexts. 30 scenarios per domain, 300 in total, instantiated into 1,000 conversations averaging 12.8 turns.

Academic Integrity
Content Moderation
Critical Infrastructure
Defamation
Discrimination
Fraud
Intellectual Property
Privacy
Sexual Content
Specialized Advice

The only safety benchmark to evaluate all three capabilities

SafePyramid is the only safety benchmark that formulates violated-rule set prediction as the evaluation task and separately evaluates individual rule understanding, rule dependency resolution, and novel context adaptation.

Benchmark Safety Domain Multi-turn Interaction Violated-rule Set Prediction In-context Specification Capability Evaluation
Understanding Individual Rule Resolving Rule Dependency Adapting to Novel Context
SafetyBenchβœ“βœ•βœ•βœ•β€“β€“β€“
HarmBenchβœ“βœ•βœ•βœ•β€“β€“β€“
DynaBenchβœ“βœ“βœ•βœ“βœ“βœ•βœ•
IFEvalβœ•βœ•βœ•βœ“βœ“βœ•βœ•
LegalBenchβœ•βœ•βœ•βœ“βœ“βœ•βœ•
CL-Benchβœ•βœ“βœ•βœ“βœ•βœ•βœ“
SafePyramidβœ“βœ“βœ“βœ“βœ“βœ“βœ“

Quality assured by four frontier generators (Grok-4.1, GPT-5.4, Gemini-3.1-Pro, Claude-Opus-4.6), cross-model validation, and two rounds of expert review β€” with >90% human–LLM agreement.

Interactive results

Leaderboard

15 models β€” 10 frontier LLMs and 5 policy-configurable guard models. Switch the evaluation protocol, choose a metric, drill into any difficulty level, and filter by domain. Click a column to sort.

Protocol
Metric
Level
Domain

What we learned

In-context policy guardrailing is far from solved

01 Β· DIFFICULTY HIERARCHY

Guardrails struggle most with novel policy frameworks

Across nearly all models, performance decreases as the benchmark moves from L0 to L2, and the difficulty is especially pronounced at L2. The gap between L0 and L2 is consistent across models, indicating that the benchmark hierarchy captures a shared difficulty pattern rather than an artifact of a single model family.

βˆ’41 ptsGPT-5.5 exact match, L0 β†’ L2 (54.0% β†’ 12.9%)
02 Β· COMPOSITION BOTTLENECK

Per-rule gains reveal a policy-composition bottleneck

Performance improves when the task is decomposed into per-rule classification, especially for guard models. Frontier LLMs are less bottlenecked by processing the full policy context, whereas smaller guard models benefit substantially when the relevant policy context is pre-organized. The main bottleneck is not rule understanding alone, but full-policy composition.

03 Β· REASONING EFFORT

Higher reasoning effort helps more on complex policies

Comparing GPT-5.5 under low and xhigh reasoning effort, the gain is small on L0, where rules are independently classifiable, but much larger on L1 and L2. Extra reasoning budget mainly helps with complex rule dependencies and novel policy frameworks, rather than independent rule understanding.

04 Β· AGENTIC HARNESS

Agentic harnesses improve in-context policy guardrailing

Off-the-shelf agentic harnesses consistently improve per-policy evaluation without safety-specific tuning. The strongest result is achieved by Claude-Opus-4.7 with Claude Code.

60.4%avg RMR with Claude Code (+5.2 pts), RDR cut to 17.4%
05 Β· ERROR ATTRIBUTION

Dominant errors shift from decisive to exception to conditional rules

At L0, errors are dominated by decisive rules — models match local keywords in the conversation but fail to check the rule's actual scope, stance, or speaker attribution. At L1, the dominant error source shifts to exception rules, as models over-trigger exceptions. At L2, conditional-rule errors become more prominent: under novel frameworks, models incorrectly treat the conditional rules themselves as violated rules, instead of using them to decide whether the corresponding base rules are violated.

Cite this work

BibTeX

SafePyramid Β· preprint 2026⧉ Copy
@misc{zhang2026safepyramid,
      title={SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing},
      author={Jiacheng Zhang and Haoyu He and Sen Zhang and Shen Wang and Xiaolei Xu and Yuhao Sun and Meng Shen and Feng Liu},
      year={2026},
      eprint={2606.29887},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.29887},
}