# Safety

> Source: https://ibl.ai/safety

Safety isn't a feature — it's the product

## The Problem: Single-Layer Safety Fails

Traditional approaches create exploitable gaps that users learn to work around

### Input filtering only

Blocks obvious requests but misses reframed prompts — academic, hypothetical, or therapeutic framing bypasses the filter

### Vendor promises

Relying on the LLM provider's built-in guardrails offers no institutional control and no visibility into what was caught or missed

### Policy statements

Written policies without enforcement infrastructure are not protection — they are documentation of intent

The failure cycle: boundary discovery → request reframing → system compliance. A single checkpoint is not enough.

ibl.ai replaces single-checkpoint safety with enforcement infrastructure

## Dual-Layer Moderation: Two Independent Safety Checkpoints

Every interaction passes through both layers — input and output are evaluated independently

User sends message → Layer 1 — Input Moderation → LLM generates response → Layer 2 — Output Safety

### Layer 1 — Input Moderation

- Evaluates user messages before the LLM processes them
- Flags direct harmful requests and evasion attempts
- Detects academic, hypothetical, and therapeutic reframing
- Blocks problematic prompts before they reach the model

### Layer 2 — Output Safety

- Evaluates model responses before delivery to the user
- Catches manipulative, authority-framed, or harmful content
- Independent verification — does not rely on the LLM's own judgment
- Blocks unsafe responses even if input moderation passed

Harmful content is blocked — not rephrased into a "safer" version. The interaction is stopped and flagged for administrative review.

Flagged interactions flow to institutional governance

## Governance: Visibility & Institutional Control

Flagged prompts are not silently discarded — they land in admin queues with full context

### What was requested

Full prompt content preserved

### Who submitted it

User identity and context

### Frequency & patterns

Repeat behavior detection

### Human intervention

Escalation triggers and workflows

### Administrative Workflow

Identify → Document → Escalate → Support

## Policy Control: Customizable Safety Policies

Different audiences have different needs — your institution controls the moderation logic

### What You Control

- Moderation logic and rules
- Sensitivity thresholds per category
- Category focus — which topics to monitor
- Audience-specific policies (minors vs. adults)
- Context-specific rules (counseling vs. coursework)

### Why It Matters

- A K-12 environment requires different guardrails than a corporate setting
- Counseling-adjacent interactions need different handling than coursework
- Regulatory requirements vary by industry and jurisdiction
- One-size-fits-all moderation leaves gaps everywhere

## Why This Matters: The Incentive Problem

Commercial AI systems optimize for engagement — institutions need the opposite

### Commercial AI Optimizes For

- Engagement and session length
- Reduced friction — fewer blocks, more compliance
- Broad applicability across consumer use cases
- Speed over caution

### Institutions Require

- Protection of users — especially vulnerable populations
- Liability reduction with auditable enforcement
- Governance and institutional oversight
- Safety as a non-negotiable, not a tradeoff

## Validation: Stress-Tested Against Real Threats

Internal testing covers both direct and evasion-based attack vectors

### Direct attack scenarios

Explicit requests for harmful content — weapons, explosives, self-harm — are blocked at the input layer

### Evasion-based scenarios

Academically framed, hypothetical, and therapeutic reframing attempts are detected and blocked

### Administrative visibility

Every blocked interaction is logged with full context and made visible to administrators

### Latency tradeoff

Output evaluation adds processing time — acceptable in contexts where safety matters more than speed

## Summary: Enforcement Infrastructure — Not Promises

What the ibl.ai safety system delivers

1. **Dual-layer moderation** — Independent input and output evaluation — two checkpoints, not one.
2. **Flagged visibility** — Every blocked interaction is preserved with full context for administrative review.
3. **Human workflows** — Identify, document, escalate, support — institutional processes connected to the safety system.
4. **Institutional policy control** — Your organization defines moderation logic, sensitivity thresholds, and audience-specific rules.
5. **Blocking, not rephrasing** — Harmful content is stopped — not softened into a compliant-sounding version.

---

*[View on ibl.ai](https://ibl.ai/safety)*