Safety

Safety isn't a feature — it's the product

The Problem

Single-Layer Safety Fails

Traditional approaches create exploitable gaps that users learn to work around

Input filtering only

Blocks obvious requests but misses reframed prompts — academic, hypothetical, or therapeutic framing bypasses the filter

Vendor promises

Relying on the LLM provider's built-in guardrails offers no institutional control and no visibility into what was caught or missed

Policy statements

Written policies without enforcement infrastructure are not protection — they are documentation of intent

The failure cycle: boundary discovery → request reframing → system compliance. A single checkpoint is not enough.

ibl.ai replaces single-checkpoint safety with enforcement infrastructure

Dual-Layer Moderation

Two Independent Safety Checkpoints

Every interaction passes through both layers — input and output are evaluated independently

User

sends message

→

↓

Layer 1 — Input Moderation

→ Evaluates user messages before the LLM processes them
→ Flags direct harmful requests and evasion attempts
→ Detects academic, hypothetical, and therapeutic reframing
→ Blocks problematic prompts before they reach the model

→

↓

LLM

generates response

→

↓

Layer 2 — Output Safety

→ Evaluates model responses before delivery to the user
→ Catches manipulative, authority-framed, or harmful content
→ Independent verification — does not rely on the LLM's own judgment
→ Blocks unsafe responses even if input moderation passed

Harmful content is blocked — not rephrased into a “safer” version. The interaction is stopped and flagged for administrative review.

Flagged interactions flow to institutional governance

Governance

Visibility & Institutional Control

Flagged prompts are not silently discarded — they land in admin queues with full context

What was requested

Full prompt content preserved

Who submitted it

User identity and context

Frequency & patterns

Repeat behavior detection

Human intervention

Escalation triggers and workflows

Administrative Workflow

Identify→

Document→

Escalate→

Support

Policy Control

Customizable Safety Policies

Different audiences have different needs — your institution controls the moderation logic

What You Control

→ Moderation logic and rules
→ Sensitivity thresholds per category
→ Category focus — which topics to monitor
→ Audience-specific policies (minors vs. adults)
→ Context-specific rules (counseling vs. coursework)

Why It Matters

→ A K-12 environment requires different guardrails than a corporate setting
→ Counseling-adjacent interactions need different handling than coursework
→ Regulatory requirements vary by industry and jurisdiction
→ One-size-fits-all moderation leaves gaps everywhere

Why This Matters

The Incentive Problem

Commercial AI systems optimize for engagement — institutions need the opposite

Commercial AI Optimizes For

→ Engagement and session length
→ Reduced friction — fewer blocks, more compliance
→ Broad applicability across consumer use cases
→ Speed over caution

Institutions Require

→ Protection of users — especially vulnerable populations
→ Liability reduction with auditable enforcement
→ Governance and institutional oversight
→ Safety as a non-negotiable, not a tradeoff

Validation

Stress-Tested Against Real Threats

Internal testing covers both direct and evasion-based attack vectors

Direct attack scenarios

Explicit requests for harmful content — weapons, explosives, self-harm — are blocked at the input layer

Evasion-based scenarios

Academically framed, hypothetical, and therapeutic reframing attempts are detected and blocked

Administrative visibility

Every blocked interaction is logged with full context and made visible to administrators

Latency tradeoff

Output evaluation adds processing time — acceptable in contexts where safety matters more than speed

Summary

Enforcement Infrastructure — Not Promises

What the ibl.ai safety system delivers

Dual-layer moderation

Independent input and output evaluation — two checkpoints, not one.

Flagged visibility

Every blocked interaction is preserved with full context for administrative review.

Human workflows

Identify, document, escalate, support — institutional processes connected to the safety system.

Institutional policy control

Your organization defines moderation logic, sensitivity thresholds, and audience-specific rules.

Blocking, not rephrasing

Harmful content is stopped — not softened into a compliant-sounding version.