On this page
- Overview
- Lifecycle
- Plan Selection
- Scoring dimensions
- When it runs
- Contract Evaluation
- Base contracts
- Adaptive overrides
- Example issues
- Verifier
- Calibration from feedback
- Contrastive verification (COMPLEX synthesis only)
- Swarm Mode
- Activation criteria
- Events
- Mid-Stream Alerts
- Checkpoints
- Retry & Self-Healing
- Policy per workflow
- Decision logic (`should_retry`)
- Healing prompt
- Iteration Complete
- Workspace Profile
- Maturity thresholds
- Events Reference
- Tuning
Agent Harness
The post-flight quality system wrapping every agent run — verifier, contract, swarm, plan selection, retries, and mid-stream alerts.
Overview
Every agent run in Praxiom AI is wrapped by the Agent Harness — a thin coordinator that runs pre-flight, mid-flight, and post-flight quality checks around _generate_sse_stream. The harness does not own the stream or replace the primary agent. It surrounds it.
The harness exists to:
- Catch hallucinations and refusals early via heuristic mid-stream alerts and an independent Haiku verifier with fresh context.
- Enforce per-workflow contracts (e.g. "synthesis must produce N cited insights") so output shape is checked algorithmically, not just by an LLM.
- Give users a quality signal — contract pass/fail, verifier score, workspace maturity, RQS/RecQS/DocQS — instead of an opaque blob of text.
- Self-heal — retry with a surgical correction prompt when contract or verifier fails, up to a per-workflow attempt cap.
- Calibrate over time — each workspace builds a profile from telemetry + feedback, which in turn tunes contract thresholds and verifier scoring.
The harness is always on — there is no per-request toggle. Overhead is typically ~300-500ms (one Haiku classification + one Haiku verifier call, ~$0.002 total).
Lifecycle
┌──────────────── PRE-FLIGHT (parallel with intent inference) ───────────────┐
│ ComplexityClassifier → SIMPLE | MEDIUM | COMPLEX + ExpectedOutput │
│ PlanGenerator (COMPLEX) → 2-3 candidate plans │
│ PlanSelector → best plan (deterministic scoring) │
│ get_quality_preamble → injects quality requirements into system prompt │
└────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────── STREAM (_generate_sse_stream) ─────────────────────────────┐
│ Primary agent runs — tokens, tool_use, tool_result, creation events │
│ StreamMonitor checks buffer at 500 / 2000 / 5000 chars │
│ └── harness_stream_alert (warning | critical) │
│ Optional Swarm mode for high-source-count synthesis │
│ └── harness_swarm_activated / harness_swarm_complete │
└────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────── POST-FLIGHT (retry loop) ──────────────────────────────────┐
│ ContractEvaluator → pass/fail, score, issues │
│ VerifierAgent → independent Haiku judge (fresh context) │
│ └── verify_contrastive for COMPLEX synthesis (accuracy+completeness+use) │
│ RetryPolicy → decide whether to retry with a healing prompt │
│ └── harness_retry if should_retry is True (up to max_attempts) │
│ harness_contract / harness_verifier / harness_rqs / harness_rec_qs / ... │
│ WorkspaceProfile → adjust thresholds, emit harness_workspace_profile │
│ Background: telemetry emission + entropy scan (fresh DB session) │
└────────────────────────────────────────────────────────────────────────────┘
Source: app/core/harness/orchestrator.py, app/api/routes/stream.py.
Plan Selection
For queries classified as COMPLEX, the harness generates 2-3 candidate execution plans using Haiku, then selects one using a deterministic scorer (no second LLM call for selection). The selected plan's prompt_modifier and subtasks feed into the primary agent run.
Scoring dimensions
Each candidate ExecutionPlan is scored on four dimensions:
| Dimension | Weight | What it measures |
|---|---|---|
coverage | 0.35 | Overlap between query keywords and plan description + prompt modifier |
feasibility | 0.25 | Whether workspace has the data the plan needs (sources for synthesis, insights for recommendations) |
efficiency | 0.20 | Fewer estimated turns = higher score (clamped to >= 1) |
alignment | 0.20 | Does suggested_workflow match the intent classifier's detection? |
Composite total = 0.35*coverage + 0.25*feasibility + 0.20*efficiency + 0.20*alignment.
When it runs
- Only for
COMPLEXqueries (ComplexityResult.is_complex == True). - Runs after complexity classification, before the primary agent starts.
- Never raises — falls back to a single default
"Execute {workflow_type} workflow directly"plan on any error.
Source: app/core/harness/planning.py.
Contract Evaluation
WorkflowContract defines what "done" means per workflow type. ContractEvaluator checks actual agent output against the contract and returns a ContractResult with passed, score (fraction of checks passed), and issues.
Base contracts
| Workflow | min_insights | min_recommendations | min_documents | require_citations | require_artifacts | min_output_chars |
|---|---|---|---|---|---|---|
synthesis | max(1, min(source_count, 10)) | 0 | 0 | yes | yes | 100 |
recommendation | 0 | 1 | 0 | no | yes | 100 |
drafting | 0 | 0 | 1 | no | yes | 200 |
full_pipeline | max(1, min(source_count, 10)) | 1 | 1 | yes | yes | 200 |
chat | 0 | 0 | 0 | no | no | 50 |
Adaptive overrides
Two layers can tighten or loosen the base contract:
ExpectedOutputfrom the complexity classifier — per-query overrides (e.g. classifier says "this query needs 5 insights with citations"). Applied inWorkflowContract.for_workflow(..., expected=...).WorkspaceProfile.quality_multiplier— mature workspaces with high historical quality get thresholds scaled by1.15; new workspaces get0.85.
Example issues
"Output too short: 42 chars (minimum 100)"
"Insufficient insights: 1 created (minimum 3)"
"Citations required but none found — insights must include direct quotes from sources"
"No artifacts created — at least one insight, recommendation, or document is required"
Source: app/core/harness/contracts.py.
Verifier
VerifierAgent is an independent Haiku judge spawned with fresh context after the primary agent completes. It receives:
- The original user query (first 500 chars)
- The agent's text output (first 8000 chars)
- A summary of artifacts created (
"3 insights, 1 document")
It returns JSON with passed, score (0-1), issues, feedback.
Key design: the verifier has no access to workspace data — it judges purely on text quality. This prevents contamination by the same context that may have led the primary agent astray.
- Model:
claude-haiku-4-5-20251001 - Cost: ~$0.001 per verification
- Latency: ~200ms
- Fallback: on API error, returns
passed=False, score=0.0, feedback="[is_fallback] ..."so retry gates still trigger.
Calibration from feedback
Every 5 minutes, the verifier rebuilds a per-workspace calibration context from up to 3 recent feedback mismatches — cases where the verifier score disagreed with user thumbs up/down. These are injected into the system prompt as "adjust scoring for similar patterns" examples, so the verifier learns each workspace's quality bar.
Contrastive verification (COMPLEX synthesis only)
For COMPLEX queries in synthesis, full_pipeline, or recommendation workflows, verify_contrastive runs three focused verifiers in parallel:
- Accuracy — are claims supported by evidence?
- Completeness — is the query fully answered?
- Usefulness — is the output actionable?
Final score = minimum across all three perspectives. All issues are tagged with their perspective, e.g. "[accuracy] Claim about market size has no citation".
Source: app/core/harness/verifier.py.
Swarm Mode
For synthesis workflows with many sources, the harness can activate a specialist swarm — multiple sub-agents running in parallel, each analyzing sources from a different angle, with an aggregator agent merging their outputs.
Activation criteria
- User's plan has
has_synthesis_swarmentitlement. - Source count typically
>= 5(treated asCOMPLEX). should_activate()returnsTruefor the workflow type.
Events
harness_swarm_activated— emitted on start with specialist count and names.tool_sandbox(tool_namesynthesis_swarm) — progress updates from each specialist.harness_swarm_complete— emitted on finish with per-specialist success counts, total duration, and estimated cost.
After the swarm finishes, the aggregator output becomes the prompt for the main agent pass, which persists the final insights via save_insights.
Source: app/api/routes/stream.py (search for swarm_active).
Mid-Stream Alerts
StreamMonitor runs lightweight regex-based checks on the accumulated output buffer at three character thresholds. No LLM calls. Designed to catch obvious failures early.
Checkpoints
| Checkpoint | Check | Triggers |
|---|---|---|
| 500 chars | Refusal / filler detection | critical if text matches "I can't", "I cannot", "I don't have access", "Unfortunately", etc. warning for synthesis/recommendation/full_pipeline if it opens with "Sure,", "Of course,", "That's a great question". |
| 2000 chars | Structural formatting | warning for synthesis/recommendation/full_pipeline/drafting if no headers (#), bullets (- or *), or numbered items are present. |
| 5000 chars | Citations / evidence | warning for synthesis/full_pipeline if no citation markers ("according to", "[1]", quoted text, etc.) are detected. |
Each threshold fires at most once per run. All alerts are emitted as harness_stream_alert SSE events with severity, checkpoint, issue, and suggestion.
Source: app/core/harness/stream_monitor.py.
Retry & Self-Healing
When an attempt fails the contract or scores too low on the verifier, RetryPolicy decides whether to retry with a surgical healing prompt built by FailureAnalyzer.
Policy per workflow
| Workflow | max_attempts | good_enough_score | low_quality_threshold |
|---|---|---|---|
synthesis, full_pipeline | settings.retry_max_attempts_synthesis (default 3) | 0.70 | 0.50 |
chat | settings.retry_max_attempts_chat (default 2) | 0.60 | 0.40 |
recommendation, drafting, other | settings.retry_max_attempts_default (default 3) | 0.65 | 0.50 |
Decision logic (should_retry)
combined = ((1.0 if contract_passed else 0.0) + verifier_score) / 2.0
if attempt >= max_attempts: return False, "Max attempts reached"
if combined >= good_enough_score: return False, "Quality sufficient"
if gap < 0.10 and contract_passed: return False, "Marginal gap — retry unlikely to help"
if gap > 0.40 and attempt >= 2: return False, "Severe gap persists — source material may be insufficient"
if is_error: return True, "Hard error — retrying"
if not contract_passed: return True, "Contract failed — retrying with healing prompt"
if verifier_score < low_quality_threshold: return True, "Low quality — retrying"
Healing prompt
FailureAnalyzer.build_healing_prompt preserves the original query and appends a structured correction block:
<original_query>
[SELF-CORRECTION: Attempt 2 of 3]
Your previous response had quality issues that must be corrected:
MISSING REQUIREMENT: Insufficient insights: 1 created (minimum 3)
MISSING REQUIREMENT: Citations required but none found
QUALITY ISSUE: Claims about market size lack source attribution
Reviewer feedback: Good analysis but synthesis requires direct quotes...
Produce a complete response that fully addresses ALL items above.
If all attempts are exhausted without passing, a harness_degraded event is emitted with attempts_made, best_score, and aggregated issues.
Source: app/core/harness/healing.py.
Iteration Complete
For deep-synthesis iteration runs (multi-pass refinement), the iteration engine emits a single harness_iteration_complete event summarizing the best result across all iterations: number of iterations, best score, accepted count, total duration, and estimated cost. The best iteration's text becomes the final response.
Workspace Profile
WorkspaceProfile is built from harness_telemetry + response_feedback tables and cached for 10 minutes per workspace.
Maturity thresholds
| Maturity | total_runs | quality_multiplier |
|---|---|---|
new | 0-9 | 0.85 (lower bar — be encouraging) |
developing | 10-49 | 1.00 (standard) |
mature | 50+ | 1.15 if avg_verifier_score > 0.75, else 1.00 |
The multiplier scales min_insights and min_output_chars in the contract. Insights are floored at 1, chars at 50.
When a workspace crosses a threshold (9→10 or 49→50) on the current run, a milestone_unlocked event is emitted alongside the harness_workspace_profile event.
Source: app/core/harness/workspace_profile.py.
Events Reference
All harness events are emitted as standard SSE (event: <type>\ndata: <json>\n\n). Full payload shapes live in the Streaming (SSE) API reference.
| Event | When | Key fields |
|---|---|---|
harness_complexity | Pre-flight | level, reason, subtask_count |
harness_plan_selected | Pre-flight (COMPLEX only) | plan, score, candidates_count |
harness_stream_alert | Mid-stream at 500 / 2000 / 5000 chars | severity, checkpoint, issue, suggestion |
harness_swarm_activated | Swarm start | source_count, specialist_count, specialists |
harness_swarm_complete | Swarm end | specialists_succeeded, aggregator_succeeded, total_duration_ms, estimated_cost_usd |
harness_subtask_start | Subtask DAG layer begins | subtask_index, total_subtasks, title, workflow_type, parallel |
harness_subtask_done | Subtask DAG layer ends | subtask_index, total_subtasks, artifacts_created, is_last |
harness_contract | Post-flight | passed, score, issues, artifact_counts, reasoning |
harness_verifier | Post-flight | passed, score, issues, feedback, reasoning, contrastive |
harness_rqs | Post-flight (synthesis) | score, grade, label, breakdown |
harness_rec_qs | Post-flight (recommendation) | score, grade, label, breakdown |
harness_doc_qs | Post-flight (drafting) | score, grade, label, breakdown |
harness_workspace_profile | Post-flight | maturity, total_runs, avg_quality, feedback_rate, common_issues, quality_multiplier |
harness_retry | Between attempts | attempt, reason, previous_score |
harness_degraded | All retries exhausted | attempts_made, best_score, workflow_type, issues |
harness_iteration_complete | Deep-synthesis iteration end | iterations, best_score, accepted_count, total_duration_ms, estimated_cost_usd |
context_metrics | Just before done | total_tool_calls, cumulative_result_chars, estimated_tool_result_tokens, largest_result, calls_per_tool |
milestone_unlocked | Workspace crosses maturity threshold | level, threshold, total_runs, description, quality_multiplier |
Tuning
Tune retry thresholds via settings
Edit app/core/config.py (or set env vars):
retry_good_enough_score: float = 0.70 # combined threshold above which no retry fires
retry_low_quality_threshold: float = 0.50 # verifier score below which quality retry triggers
retry_max_attempts_synthesis: int = 3 # max attempts for synthesis/full_pipeline
retry_max_attempts_chat: int = 2 # max attempts for chat
retry_max_attempts_default: int = 3 # max attempts for recommendation/drafting/otherSwap verifier prompts
Verifier system prompts are loaded via PromptRegistry.get(prompt_id, tier=tier) with hardcoded fallbacks. Override these IDs in the registry to change verifier behavior without a deploy:
verifier_system— standard verifierverifier_accuracy,verifier_completeness,verifier_usefulness— contrastive perspectives
Adjust contract thresholds per workflow
Edit WorkflowContract._static_for_workflow in app/core/harness/contracts.py. All base contracts are declared in a single contracts dict — changes apply to every future run.
Disable swarm per user
Swarm activation is gated by the user's plan entitlement has_synthesis_swarm. Downgrade or upgrade the plan to toggle. There is no per-request disable flag.
The harness never raises to the user. Every component (ComplexityClassifier, VerifierAgent, PlanGenerator, workspace profile loading) has a fallback path that logs and returns a safe default. If Haiku is down, the run still completes — it just ships with score=0.0, feedback="[is_fallback]" so downstream retry logic knows verification was skipped.
Was this helpful?