SafePyramid · A Hierarchical Benchmark for In-context Policy Guardrailing

The shift

From fixed taxonomies to policies in context

Conventional guardrails map an interaction to a coarse, predefined risk category. In-context policy guardrailing instead expands each risk category into explicit and auditable safety policies provided in context at inference time — making safety criteria transparent, inspectable, traceable, and accountable, while remaining adaptable to application-specific requirements.

Figure 1: Fixed-taxonomy guardrailing versus in-context policy guardrailing — Figure 1 Visual comparison between fixed-taxonomy guardrailing and in-context policy guardrailing. Both paradigms are derived from risk taxonomies, but fixed-taxonomy guardrails map user–model interactions to coarse risk categories, while in-context policy guardrails expand each risk category into explicit and auditable safety policies provided in context at inference time. Such policies make safety criteria transparent, inspectable, traceable, and accountable, while remaining adaptable to application-specific requirements.

Three capabilities, three levels

A pyramid of increasing difficulty

Each level keeps the core task — identify every violated rule — while adding one source of complexity. This disentangles where models fail: understanding a rule, resolving how rules interact, or learning an entirely new policy framework from context.

LEVEL 0
Understanding individual rulesCan a model decide whether a single explicit rule is actually supported by conversational evidence?
D
Decisive rules
Explicitly allowed or prohibited behaviors with sufficient evidence to judge.
✕
Distractor rules
Topically related but not triggered — tests over-matching on surface similarity.
16,114 rules16.1 / policy
LEVEL 1
Resolving rule dependenciesRules interact: one clause can override or activate another, making violations context-dependent.
E
Exception rules
A higher-priority rule that waives or reinterprets a base rule in a given context.
C
Conditional rules
Extra requirements that make an otherwise-compliant rule violated when conditions hold.
28,717 rules28.7 / policy
LEVEL 2
Adapting to novel frameworksThe same rule structure, rewritten under a fully fictional policy framework with new concepts and terminology.
✦
Fictional framework
Prevents solving by prior knowledge — models must infer the framework from context alone.
∑
Preserved structure
Keeps L1's decisive, distractor, exception & conditional rules under new semantics.
32,982 rules33.0 / policy

Coverage & positioning

The benchmark at a glance

10 domains where policy boundaries vary

Adapted from AILuminate — chosen because their safety boundaries commonly differ across organizations, user groups, and regulatory contexts. 30 scenarios per domain, 300 in total, instantiated into 1,000 conversations averaging 12.8 turns.

Academic Integrity

Content Moderation

Critical Infrastructure

Defamation

Discrimination

Fraud

Intellectual Property

Privacy

Sexual Content

Specialized Advice

The only safety benchmark to evaluate all three capabilities

SafePyramid is the only safety benchmark that formulates violated-rule set prediction as the evaluation task and separately evaluates individual rule understanding, rule dependency resolution, and novel context adaptation.

Benchmark	Safety Domain	Multi-turn Interaction	Violated-rule Set Prediction	In-context Specification	Capability Evaluation
Benchmark	Safety Domain	Multi-turn Interaction	Violated-rule Set Prediction	In-context Specification	Understanding Individual Rule	Resolving Rule Dependency	Adapting to Novel Context
SafetyBench	✓	✕	✕	✕	–	–	–
HarmBench	✓	✕	✕	✕	–	–	–
DynaBench	✓	✓	✕	✓	✓	✕	✕
IFEval	✕	✕	✕	✓	✓	✕	✕
LegalBench	✕	✕	✕	✓	✓	✕	✕
CL-Bench	✕	✓	✕	✓	✕	✕	✓
SafePyramid	✓	✓	✓	✓	✓	✓	✓

Quality assured by four frontier generators (Grok-4.1, GPT-5.4, Gemini-3.1-Pro, Claude-Opus-4.6), cross-model validation, and two rounds of expert review — with >90% human–LLM agreement.

Interactive results

Leaderboard

15 models — 10 frontier LLMs and 5 policy-configurable guard models. Switch the evaluation protocol, choose a metric, drill into any difficulty level, and filter by domain. Click a column to sort.

Protocol

Metric

Level

Domain

What we learned

In-context policy guardrailing is far from solved

01 · DIFFICULTY HIERARCHY

Guardrails struggle most with novel policy frameworks

Across nearly all models, performance decreases as the benchmark moves from L0 to L2, and the difficulty is especially pronounced at L2. The gap between L0 and L2 is consistent across models, indicating that the benchmark hierarchy captures a shared difficulty pattern rather than an artifact of a single model family.

−41 ptsGPT-5.5 exact match, L0 → L2 (54.0% → 12.9%)

02 · COMPOSITION BOTTLENECK

Per-rule gains reveal a policy-composition bottleneck

Performance improves when the task is decomposed into per-rule classification, especially for guard models. Frontier LLMs are less bottlenecked by processing the full policy context, whereas smaller guard models benefit substantially when the relevant policy context is pre-organized. The main bottleneck is not rule understanding alone, but full-policy composition.

03 · REASONING EFFORT

Higher reasoning effort helps more on complex policies

Comparing GPT-5.5 under low and xhigh reasoning effort, the gain is small on L0, where rules are independently classifiable, but much larger on L1 and L2. Extra reasoning budget mainly helps with complex rule dependencies and novel policy frameworks, rather than independent rule understanding.

04 · AGENTIC HARNESS

Agentic harnesses improve in-context policy guardrailing

Off-the-shelf agentic harnesses consistently improve per-policy evaluation without safety-specific tuning. The strongest result is achieved by Claude-Opus-4.7 with Claude Code.

60.4%avg RMR with Claude Code (+5.2 pts), RDR cut to 17.4%

05 · ERROR ATTRIBUTION

Dominant errors shift from decisive to exception to conditional rules

At L0, errors are dominated by decisive rules — models match local keywords in the conversation but fail to check the rule's actual scope, stance, or speaker attribution. At L1, the dominant error source shifts to exception rules, as models over-trigger exceptions. At L2, conditional-rule errors become more prominent: under novel frameworks, models incorrectly treat the conditional rules themselves as violated rules, instead of using them to decide whether the corresponding base rules are violated.

Cite this work

BibTeX

SafePyramid · preprint 2026⧉ Copy

@misc{zhang2026safepyramid,
      title={SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing},
      author={Jiacheng Zhang and Haoyu He and Sen Zhang and Shen Wang and Xiaolei Xu and Yuhao Sun and Meng Shen and Feng Liu},
      year={2026},
      eprint={2606.29887},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.29887},
}