Agent Harness

The post-flight quality system wrapping every agent run — verifier, contract, swarm, plan selection, retries, and mid-stream alerts.

Overview

Every agent run in Praxiom AI is wrapped by the Agent Harness — a thin coordinator that runs pre-flight, mid-flight, and post-flight quality checks around _generate_sse_stream. The harness does not own the stream or replace the primary agent. It surrounds it.

The harness exists to:

Catch hallucinations and refusals early via heuristic mid-stream alerts and an independent Haiku verifier with fresh context.
Enforce per-workflow contracts (e.g. "synthesis must produce N cited insights") so output shape is checked algorithmically, not just by an LLM.
Give users a quality signal — contract pass/fail, verifier score, workspace maturity, RQS/RecQS/DocQS — instead of an opaque blob of text.
Self-heal — retry with a surgical correction prompt when contract or verifier fails, up to a per-workflow attempt cap.
Calibrate over time — each workspace builds a profile from telemetry + feedback, which in turn tunes contract thresholds and verifier scoring.

The harness is always on — there is no per-request toggle. Overhead is typically ~300-500ms (one Haiku classification + one Haiku verifier call, ~$0.002 total).

Lifecycle

┌──────────────── PRE-FLIGHT (parallel with intent inference) ───────────────┐
│  ComplexityClassifier    → SIMPLE | MEDIUM | COMPLEX + ExpectedOutput       │
│  PlanGenerator (COMPLEX) → 2-3 candidate plans                              │
│  PlanSelector            → best plan (deterministic scoring)                │
│  get_quality_preamble    → injects quality requirements into system prompt  │
└────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌──────────────── STREAM (_generate_sse_stream) ─────────────────────────────┐
│  Primary agent runs — tokens, tool_use, tool_result, creation events        │
│  StreamMonitor checks buffer at 500 / 2000 / 5000 chars                    │
│    └── harness_stream_alert (warning | critical)                            │
│  Optional Swarm mode for high-source-count synthesis                        │
│    └── harness_swarm_activated / harness_swarm_complete                     │
└────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌──────────────── POST-FLIGHT (retry loop) ──────────────────────────────────┐
│  ContractEvaluator   → pass/fail, score, issues                             │
│  VerifierAgent       → independent Haiku judge (fresh context)              │
│    └── verify_contrastive for COMPLEX synthesis (accuracy+completeness+use) │
│  RetryPolicy         → decide whether to retry with a healing prompt        │
│    └── harness_retry if should_retry is True (up to max_attempts)           │
│  harness_contract / harness_verifier / harness_rqs / harness_rec_qs / ...   │
│  WorkspaceProfile    → adjust thresholds, emit harness_workspace_profile    │
│  Background: telemetry emission + entropy scan (fresh DB session)           │
└────────────────────────────────────────────────────────────────────────────┘

Source: app/core/harness/orchestrator.py, app/api/routes/stream.py.

Plan Selection

For queries classified as COMPLEX, the harness generates 2-3 candidate execution plans using Haiku, then selects one using a deterministic scorer (no second LLM call for selection). The selected plan's prompt_modifier and subtasks feed into the primary agent run.

Scoring dimensions

Each candidate ExecutionPlan is scored on four dimensions:

Dimension	Weight	What it measures
`coverage`	0.35	Overlap between query keywords and plan description + prompt modifier
`feasibility`	0.25	Whether workspace has the data the plan needs (sources for synthesis, insights for recommendations)
`efficiency`	0.20	Fewer estimated turns = higher score (clamped to `>= 1`)
`alignment`	0.20	Does `suggested_workflow` match the intent classifier's detection?

Composite total = 0.35*coverage + 0.25*feasibility + 0.20*efficiency + 0.20*alignment.

When it runs

Only for COMPLEX queries (ComplexityResult.is_complex == True).
Runs after complexity classification, before the primary agent starts.
Never raises — falls back to a single default "Execute {workflow_type} workflow directly" plan on any error.

Source: app/core/harness/planning.py.

Contract Evaluation

WorkflowContract defines what "done" means per workflow type. ContractEvaluator checks actual agent output against the contract and returns a ContractResult with passed, score (fraction of checks passed), and issues.

Base contracts

Workflow	min_insights	min_recommendations	min_documents	require_citations	require_artifacts	min_output_chars
`synthesis`	`max(1, min(source_count, 10))`	0	0	yes	yes	100
`recommendation`	0	1	0	no	yes	100
`drafting`	0	0	1	no	yes	200
`full_pipeline`	`max(1, min(source_count, 10))`	1	1	yes	yes	200
`chat`	0	0	0	no	no	50

Adaptive overrides

Two layers can tighten or loosen the base contract:

ExpectedOutput from the complexity classifier — per-query overrides (e.g. classifier says "this query needs 5 insights with citations"). Applied in WorkflowContract.for_workflow(..., expected=...).
WorkspaceProfile.quality_multiplier — mature workspaces with high historical quality get thresholds scaled by 1.15; new workspaces get 0.85.

Example issues

"Output too short: 42 chars (minimum 100)"
"Insufficient insights: 1 created (minimum 3)"
"Citations required but none found — insights must include direct quotes from sources"
"No artifacts created — at least one insight, recommendation, or document is required"

Source: app/core/harness/contracts.py.

Verifier

VerifierAgent is an independent Haiku judge spawned with fresh context after the primary agent completes. It receives:

The original user query (first 500 chars)
The agent's text output (first 8000 chars)
A summary of artifacts created ("3 insights, 1 document")

It returns JSON with passed, score (0-1), issues, feedback.

Key design: the verifier has no access to workspace data — it judges purely on text quality. This prevents contamination by the same context that may have led the primary agent astray.

Model: claude-haiku-4-5-20251001
Cost: ~$0.001 per verification
Latency: ~200ms
Fallback: on API error, returns passed=False, score=0.0, feedback="[is_fallback] ..." so retry gates still trigger.

Calibration from feedback

Every 5 minutes, the verifier rebuilds a per-workspace calibration context from up to 3 recent feedback mismatches — cases where the verifier score disagreed with user thumbs up/down. These are injected into the system prompt as "adjust scoring for similar patterns" examples, so the verifier learns each workspace's quality bar.

Contrastive verification (COMPLEX synthesis only)

For COMPLEX queries in synthesis, full_pipeline, or recommendation workflows, verify_contrastive runs three focused verifiers in parallel:

Accuracy — are claims supported by evidence?
Completeness — is the query fully answered?
Usefulness — is the output actionable?

Final score = minimum across all three perspectives. All issues are tagged with their perspective, e.g. "[accuracy] Claim about market size has no citation".

Source: app/core/harness/verifier.py.

Swarm Mode

For synthesis workflows with many sources, the harness can activate a specialist swarm — multiple sub-agents running in parallel, each analyzing sources from a different angle, with an aggregator agent merging their outputs.

Activation criteria

User's plan has has_synthesis_swarm entitlement.
Source count typically >= 5 (treated as COMPLEX).
should_activate() returns True for the workflow type.

Events

harness_swarm_activated — emitted on start with specialist count and names.
tool_sandbox (tool_name synthesis_swarm) — progress updates from each specialist.
harness_swarm_complete — emitted on finish with per-specialist success counts, total duration, and estimated cost.

After the swarm finishes, the aggregator output becomes the prompt for the main agent pass, which persists the final insights via save_insights.

Source: app/api/routes/stream.py (search for swarm_active).

Mid-Stream Alerts

StreamMonitor runs lightweight regex-based checks on the accumulated output buffer at three character thresholds. No LLM calls. Designed to catch obvious failures early.

Checkpoints

Checkpoint	Check	Triggers
500 chars	Refusal / filler detection	`critical` if text matches `"I can't"`, `"I cannot"`, `"I don't have access"`, `"Unfortunately"`, etc. `warning` for synthesis/recommendation/full_pipeline if it opens with `"Sure,"`, `"Of course,"`, `"That's a great question"`.
2000 chars	Structural formatting	`warning` for synthesis/recommendation/full_pipeline/drafting if no headers (`#`), bullets (`-` or `*`), or numbered items are present.
5000 chars	Citations / evidence	`warning` for synthesis/full_pipeline if no citation markers (`"according to"`, `"[1]"`, quoted text, etc.) are detected.

Each threshold fires at most once per run. All alerts are emitted as harness_stream_alert SSE events with severity, checkpoint, issue, and suggestion.

Source: app/core/harness/stream_monitor.py.

Retry & Self-Healing

When an attempt fails the contract or scores too low on the verifier, RetryPolicy decides whether to retry with a surgical healing prompt built by FailureAnalyzer.

Policy per workflow

Workflow	max_attempts	good_enough_score	low_quality_threshold
`synthesis`, `full_pipeline`	`settings.retry_max_attempts_synthesis` (default 3)	0.70	0.50
`chat`	`settings.retry_max_attempts_chat` (default 2)	0.60	0.40
`recommendation`, `drafting`, other	`settings.retry_max_attempts_default` (default 3)	0.65	0.50

Decision logic (`should_retry`)

combined = ((1.0 if contract_passed else 0.0) + verifier_score) / 2.0

if attempt >= max_attempts:             return False, "Max attempts reached"
if combined >= good_enough_score:       return False, "Quality sufficient"
if gap < 0.10 and contract_passed:      return False, "Marginal gap — retry unlikely to help"
if gap > 0.40 and attempt >= 2:         return False, "Severe gap persists — source material may be insufficient"
if is_error:                             return True,  "Hard error — retrying"
if not contract_passed:                  return True,  "Contract failed — retrying with healing prompt"
if verifier_score < low_quality_threshold: return True, "Low quality — retrying"

Healing prompt

FailureAnalyzer.build_healing_prompt preserves the original query and appends a structured correction block:

<original_query>

[SELF-CORRECTION: Attempt 2 of 3]
Your previous response had quality issues that must be corrected:

  MISSING REQUIREMENT: Insufficient insights: 1 created (minimum 3)
  MISSING REQUIREMENT: Citations required but none found
  QUALITY ISSUE: Claims about market size lack source attribution

Reviewer feedback: Good analysis but synthesis requires direct quotes...

Produce a complete response that fully addresses ALL items above.

If all attempts are exhausted without passing, a harness_degraded event is emitted with attempts_made, best_score, and aggregated issues.

Source: app/core/harness/healing.py.

Iteration Complete

For deep-synthesis iteration runs (multi-pass refinement), the iteration engine emits a single harness_iteration_complete event summarizing the best result across all iterations: number of iterations, best score, accepted count, total duration, and estimated cost. The best iteration's text becomes the final response.

Workspace Profile

WorkspaceProfile is built from harness_telemetry + response_feedback tables and cached for 10 minutes per workspace.

Maturity thresholds

Maturity	`total_runs`	`quality_multiplier`
`new`	0-9	0.85 (lower bar — be encouraging)
`developing`	10-49	1.00 (standard)
`mature`	50+	1.15 if `avg_verifier_score > 0.75`, else 1.00

The multiplier scales min_insights and min_output_chars in the contract. Insights are floored at 1, chars at 50.

When a workspace crosses a threshold (9→10 or 49→50) on the current run, a milestone_unlocked event is emitted alongside the harness_workspace_profile event.

Source: app/core/harness/workspace_profile.py.

Events Reference

All harness events are emitted as standard SSE (event: <type>\ndata: <json>\n\n). Full payload shapes live in the Streaming (SSE) API reference.

Event	When	Key fields
`harness_complexity`	Pre-flight	`level`, `reason`, `subtask_count`
`harness_plan_selected`	Pre-flight (COMPLEX only)	`plan`, `score`, `candidates_count`
`harness_stream_alert`	Mid-stream at 500 / 2000 / 5000 chars	`severity`, `checkpoint`, `issue`, `suggestion`
`harness_swarm_activated`	Swarm start	`source_count`, `specialist_count`, `specialists`
`harness_swarm_complete`	Swarm end	`specialists_succeeded`, `aggregator_succeeded`, `total_duration_ms`, `estimated_cost_usd`
`harness_subtask_start`	Subtask DAG layer begins	`subtask_index`, `total_subtasks`, `title`, `workflow_type`, `parallel`
`harness_subtask_done`	Subtask DAG layer ends	`subtask_index`, `total_subtasks`, `artifacts_created`, `is_last`
`harness_contract`	Post-flight	`passed`, `score`, `issues`, `artifact_counts`, `reasoning`
`harness_verifier`	Post-flight	`passed`, `score`, `issues`, `feedback`, `reasoning`, `contrastive`
`harness_rqs`	Post-flight (synthesis)	`score`, `grade`, `label`, `breakdown`
`harness_rec_qs`	Post-flight (recommendation)	`score`, `grade`, `label`, `breakdown`
`harness_doc_qs`	Post-flight (drafting)	`score`, `grade`, `label`, `breakdown`
`harness_workspace_profile`	Post-flight	`maturity`, `total_runs`, `avg_quality`, `feedback_rate`, `common_issues`, `quality_multiplier`
`harness_retry`	Between attempts	`attempt`, `reason`, `previous_score`
`harness_degraded`	All retries exhausted	`attempts_made`, `best_score`, `workflow_type`, `issues`
`harness_iteration_complete`	Deep-synthesis iteration end	`iterations`, `best_score`, `accepted_count`, `total_duration_ms`, `estimated_cost_usd`
`context_metrics`	Just before `done`	`total_tool_calls`, `cumulative_result_chars`, `estimated_tool_result_tokens`, `largest_result`, `calls_per_tool`
`milestone_unlocked`	Workspace crosses maturity threshold	`level`, `threshold`, `total_runs`, `description`, `quality_multiplier`

Tuning

Tune retry thresholds via settings

Edit app/core/config.py (or set env vars):

retry_good_enough_score: float = 0.70      # combined threshold above which no retry fires
retry_low_quality_threshold: float = 0.50  # verifier score below which quality retry triggers
retry_max_attempts_synthesis: int = 3      # max attempts for synthesis/full_pipeline
retry_max_attempts_chat: int = 2           # max attempts for chat
retry_max_attempts_default: int = 3        # max attempts for recommendation/drafting/other

Swap verifier prompts

Verifier system prompts are loaded via PromptRegistry.get(prompt_id, tier=tier) with hardcoded fallbacks. Override these IDs in the registry to change verifier behavior without a deploy:

verifier_system — standard verifier
verifier_accuracy, verifier_completeness, verifier_usefulness — contrastive perspectives

Adjust contract thresholds per workflow

Edit WorkflowContract._static_for_workflow in app/core/harness/contracts.py. All base contracts are declared in a single contracts dict — changes apply to every future run.

Disable swarm per user

Swarm activation is gated by the user's plan entitlement has_synthesis_swarm. Downgrade or upgrade the plan to toggle. There is no per-request disable flag.

The harness never raises to the user. Every component (ComplexityClassifier, VerifierAgent, PlanGenerator, workspace profile loading) has a fallback path that logs and returns a safe default. If Haiku is down, the run still completes — it just ships with score=0.0, feedback="[is_fallback]" so downstream retry logic knows verification was skipped.

Was this helpful?

PreviousPending Actions (Human-in-the-Loop)NextExecution Tickets