Refusal Detection

ML-powered detection of LLM refusals with automatic handling options.

How It Works

Tollbooth uses a zero-shot classification model (bundled in the Docker image) to analyze LLM responses for refusal patterns. Both text content and thinking blocks are analyzed.

No external API calls are made for detection—the model runs locally.

Creating LLM Rules

LLM Rules are configured in the Rules View under the LLM Rules tab.

Basic Settings

Field	Description
Name	Descriptive name
Enabled	Toggle on/off

Detection Settings

Setting	Description
Confidence Threshold	Minimum confidence to trigger (0-1, default 0.7)
Tokens to Analyze	Number of tokens to check (0 = all)

Lower thresholds catch more refusals but may have false positives.

Actions

Action	Description
Passthrough	Log the refusal, forward unchanged
Prompt User	Hold in queue for manual review
Modify	Auto-generate replacement response

Fallback Configuration (Modify Action)

When using the Modify action, configure how replacements are generated:

Field	Description
Provider	LLM provider for generating replacements
Custom Prompt	Template for generating alternatives
System Prompt	Instructions for the replacement LLM

Filters (Optional)

Narrow which traffic the rule applies to:

Filter	Description
Host	Match specific API hosts
Path	Match specific API paths
Model	Match specific model names
Provider	anthropic, openai, google

Pending Refusals Queue

When a rule's action is Prompt User, detected refusals appear in a separate queue.

Queue Item Details

Original response content
Confidence score
Tokens analyzed
Which rule triggered detection

Actions

Action	Description
Approve	Forward original response unchanged
Generate Alternative	Use LLM to create replacement
Forward Modified	Send custom or generated response

Timeout

Auto-Forward

Pending refusals auto-forward after 5 minutes to prevent hangs.

Visual Indicators

Traffic and conversation views show badges:

Badge	Meaning
Orange Refusal	Refusal detected, not modified
Purple Modified	Refusal detected and replaced

Example Use Cases

Log All Refusals

Create a rule with:

Action: Passthrough
Confidence: 0.7
No filters (apply to all)

Refusals are logged but traffic flows normally.

Review Before Forwarding

Create a rule with:

Action: Prompt User
Confidence: 0.8 (higher to reduce false positives)
Filter: Provider = anthropic

Anthropic refusals are held for your review.

Auto-Replace Refusals

Create a rule with:

Action: Modify
Confidence: 0.9 (high to avoid replacing valid responses)
Fallback provider configured
Custom prompt: "Provide a helpful response that addresses the user's request"

Detected refusals are automatically replaced.

Troubleshooting

Detection Not Working

Check that LLM rules are enabled
Lower the confidence threshold
Check backend logs: docker compose logs -f backend

High False Positives

Raise the confidence threshold
Add filters to narrow scope
Use Prompt User action to review manually

Slow Detection

The ML model loads on first use. Subsequent detections are fast.