Agent Harness

The post-flight quality system wrapping every agent run — verifier, contract, swarm, plan selection, retries, and mid-stream alerts.

Overview

Every agent run in Praxiom AI is wrapped by the Agent Harness — a thin coordinator that runs pre-flight, mid-flight, and post-flight quality checks around _generate_sse_stream. The harness does not own the stream or replace the primary agent. It surrounds it.

The harness exists to:

  • Catch hallucinations and refusals early via heuristic mid-stream alerts and an independent Haiku verifier with fresh context.
  • Enforce per-workflow contracts (e.g. "synthesis must produce N cited insights") so output shape is checked algorithmically, not just by an LLM.
  • Give users a quality signal — contract pass/fail, verifier score, workspace maturity, RQS/RecQS/DocQS — instead of an opaque blob of text.
  • Self-heal — retry with a surgical correction prompt when contract or verifier fails, up to a per-workflow attempt cap.
  • Calibrate over time — each workspace builds a profile from telemetry + feedback, which in turn tunes contract thresholds and verifier scoring.

The harness is always on — there is no per-request toggle. Overhead is typically ~300-500ms (one Haiku classification + one Haiku verifier call, ~$0.002 total).


Lifecycle

┌──────────────── PRE-FLIGHT (parallel with intent inference) ───────────────┐
│  ComplexityClassifier    → SIMPLE | MEDIUM | COMPLEX + ExpectedOutput       │
│  PlanGenerator (COMPLEX) → 2-3 candidate plans                              │
│  PlanSelector            → best plan (deterministic scoring)                │
│  get_quality_preamble    → injects quality requirements into system prompt  │
└────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌──────────────── STREAM (_generate_sse_stream) ─────────────────────────────┐
│  Primary agent runs — tokens, tool_use, tool_result, creation events        │
│  StreamMonitor checks buffer at 500 / 2000 / 5000 chars                    │
│    └── harness_stream_alert (warning | critical)                            │
│  Optional Swarm mode for high-source-count synthesis                        │
│    └── harness_swarm_activated / harness_swarm_complete                     │
└────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌──────────────── POST-FLIGHT (retry loop) ──────────────────────────────────┐
│  ContractEvaluator   → pass/fail, score, issues                             │
│  VerifierAgent       → independent Haiku judge (fresh context)              │
│    └── verify_contrastive for COMPLEX synthesis (accuracy+completeness+use) │
│  RetryPolicy         → decide whether to retry with a healing prompt        │
│    └── harness_retry if should_retry is True (up to max_attempts)           │
│  harness_contract / harness_verifier / harness_rqs / harness_rec_qs / ...   │
│  WorkspaceProfile    → adjust thresholds, emit harness_workspace_profile    │
│  Background: telemetry emission + entropy scan (fresh DB session)           │
└────────────────────────────────────────────────────────────────────────────┘

Source: app/core/harness/orchestrator.py, app/api/routes/stream.py.


Plan Selection

For queries classified as COMPLEX, the harness generates 2-3 candidate execution plans using Haiku, then selects one using a deterministic scorer (no second LLM call for selection). The selected plan's prompt_modifier and subtasks feed into the primary agent run.

Scoring dimensions

Each candidate ExecutionPlan is scored on four dimensions:

DimensionWeightWhat it measures
coverage0.35Overlap between query keywords and plan description + prompt modifier
feasibility0.25Whether workspace has the data the plan needs (sources for synthesis, insights for recommendations)
efficiency0.20Fewer estimated turns = higher score (clamped to >= 1)
alignment0.20Does suggested_workflow match the intent classifier's detection?

Composite total = 0.35*coverage + 0.25*feasibility + 0.20*efficiency + 0.20*alignment.

When it runs

  • Only for COMPLEX queries (ComplexityResult.is_complex == True).
  • Runs after complexity classification, before the primary agent starts.
  • Never raises — falls back to a single default "Execute {workflow_type} workflow directly" plan on any error.

Source: app/core/harness/planning.py.


Contract Evaluation

WorkflowContract defines what "done" means per workflow type. ContractEvaluator checks actual agent output against the contract and returns a ContractResult with passed, score (fraction of checks passed), and issues.

Base contracts

Workflowmin_insightsmin_recommendationsmin_documentsrequire_citationsrequire_artifactsmin_output_chars
synthesismax(1, min(source_count, 10))00yesyes100
recommendation010noyes100
drafting001noyes200
full_pipelinemax(1, min(source_count, 10))11yesyes200
chat000nono50

Adaptive overrides

Two layers can tighten or loosen the base contract:

  1. ExpectedOutput from the complexity classifier — per-query overrides (e.g. classifier says "this query needs 5 insights with citations"). Applied in WorkflowContract.for_workflow(..., expected=...).
  2. WorkspaceProfile.quality_multiplier — mature workspaces with high historical quality get thresholds scaled by 1.15; new workspaces get 0.85.

Example issues

"Output too short: 42 chars (minimum 100)"
"Insufficient insights: 1 created (minimum 3)"
"Citations required but none found — insights must include direct quotes from sources"
"No artifacts created — at least one insight, recommendation, or document is required"

Source: app/core/harness/contracts.py.


Verifier

VerifierAgent is an independent Haiku judge spawned with fresh context after the primary agent completes. It receives:

  • The original user query (first 500 chars)
  • The agent's text output (first 8000 chars)
  • A summary of artifacts created ("3 insights, 1 document")

It returns JSON with passed, score (0-1), issues, feedback.

Key design: the verifier has no access to workspace data — it judges purely on text quality. This prevents contamination by the same context that may have led the primary agent astray.

  • Model: claude-haiku-4-5-20251001
  • Cost: ~$0.001 per verification
  • Latency: ~200ms
  • Fallback: on API error, returns passed=False, score=0.0, feedback="[is_fallback] ..." so retry gates still trigger.

Calibration from feedback

Every 5 minutes, the verifier rebuilds a per-workspace calibration context from up to 3 recent feedback mismatches — cases where the verifier score disagreed with user thumbs up/down. These are injected into the system prompt as "adjust scoring for similar patterns" examples, so the verifier learns each workspace's quality bar.

Contrastive verification (COMPLEX synthesis only)

For COMPLEX queries in synthesis, full_pipeline, or recommendation workflows, verify_contrastive runs three focused verifiers in parallel:

  • Accuracy — are claims supported by evidence?
  • Completeness — is the query fully answered?
  • Usefulness — is the output actionable?

Final score = minimum across all three perspectives. All issues are tagged with their perspective, e.g. "[accuracy] Claim about market size has no citation".

Source: app/core/harness/verifier.py.


Swarm Mode

For synthesis workflows with many sources, the harness can activate a specialist swarm — multiple sub-agents running in parallel, each analyzing sources from a different angle, with an aggregator agent merging their outputs.

Activation criteria

  • User's plan has has_synthesis_swarm entitlement.
  • Source count typically >= 5 (treated as COMPLEX).
  • should_activate() returns True for the workflow type.

Events

  • harness_swarm_activated — emitted on start with specialist count and names.
  • tool_sandbox (tool_name synthesis_swarm) — progress updates from each specialist.
  • harness_swarm_complete — emitted on finish with per-specialist success counts, total duration, and estimated cost.

After the swarm finishes, the aggregator output becomes the prompt for the main agent pass, which persists the final insights via save_insights.

Source: app/api/routes/stream.py (search for swarm_active).


Mid-Stream Alerts

StreamMonitor runs lightweight regex-based checks on the accumulated output buffer at three character thresholds. No LLM calls. Designed to catch obvious failures early.

Checkpoints

CheckpointCheckTriggers
500 charsRefusal / filler detectioncritical if text matches "I can't", "I cannot", "I don't have access", "Unfortunately", etc. warning for synthesis/recommendation/full_pipeline if it opens with "Sure,", "Of course,", "That's a great question".
2000 charsStructural formattingwarning for synthesis/recommendation/full_pipeline/drafting if no headers (#), bullets (- or *), or numbered items are present.
5000 charsCitations / evidencewarning for synthesis/full_pipeline if no citation markers ("according to", "[1]", quoted text, etc.) are detected.

Each threshold fires at most once per run. All alerts are emitted as harness_stream_alert SSE events with severity, checkpoint, issue, and suggestion.

Source: app/core/harness/stream_monitor.py.


Retry & Self-Healing

When an attempt fails the contract or scores too low on the verifier, RetryPolicy decides whether to retry with a surgical healing prompt built by FailureAnalyzer.

Policy per workflow

Workflowmax_attemptsgood_enough_scorelow_quality_threshold
synthesis, full_pipelinesettings.retry_max_attempts_synthesis (default 3)0.700.50
chatsettings.retry_max_attempts_chat (default 2)0.600.40
recommendation, drafting, othersettings.retry_max_attempts_default (default 3)0.650.50

Decision logic (should_retry)

combined = ((1.0 if contract_passed else 0.0) + verifier_score) / 2.0

if attempt >= max_attempts:             return False, "Max attempts reached"
if combined >= good_enough_score:       return False, "Quality sufficient"
if gap < 0.10 and contract_passed:      return False, "Marginal gap — retry unlikely to help"
if gap > 0.40 and attempt >= 2:         return False, "Severe gap persists — source material may be insufficient"
if is_error:                             return True,  "Hard error — retrying"
if not contract_passed:                  return True,  "Contract failed — retrying with healing prompt"
if verifier_score < low_quality_threshold: return True, "Low quality — retrying"

Healing prompt

FailureAnalyzer.build_healing_prompt preserves the original query and appends a structured correction block:

<original_query>

[SELF-CORRECTION: Attempt 2 of 3]
Your previous response had quality issues that must be corrected:

  MISSING REQUIREMENT: Insufficient insights: 1 created (minimum 3)
  MISSING REQUIREMENT: Citations required but none found
  QUALITY ISSUE: Claims about market size lack source attribution

Reviewer feedback: Good analysis but synthesis requires direct quotes...

Produce a complete response that fully addresses ALL items above.

If all attempts are exhausted without passing, a harness_degraded event is emitted with attempts_made, best_score, and aggregated issues.

Source: app/core/harness/healing.py.


Iteration Complete

For deep-synthesis iteration runs (multi-pass refinement), the iteration engine emits a single harness_iteration_complete event summarizing the best result across all iterations: number of iterations, best score, accepted count, total duration, and estimated cost. The best iteration's text becomes the final response.


Workspace Profile

WorkspaceProfile is built from harness_telemetry + response_feedback tables and cached for 10 minutes per workspace.

Maturity thresholds

Maturitytotal_runsquality_multiplier
new0-90.85 (lower bar — be encouraging)
developing10-491.00 (standard)
mature50+1.15 if avg_verifier_score > 0.75, else 1.00

The multiplier scales min_insights and min_output_chars in the contract. Insights are floored at 1, chars at 50.

When a workspace crosses a threshold (9→10 or 49→50) on the current run, a milestone_unlocked event is emitted alongside the harness_workspace_profile event.

Source: app/core/harness/workspace_profile.py.


Events Reference

All harness events are emitted as standard SSE (event: <type>\ndata: <json>\n\n). Full payload shapes live in the Streaming (SSE) API reference.

EventWhenKey fields
harness_complexityPre-flightlevel, reason, subtask_count
harness_plan_selectedPre-flight (COMPLEX only)plan, score, candidates_count
harness_stream_alertMid-stream at 500 / 2000 / 5000 charsseverity, checkpoint, issue, suggestion
harness_swarm_activatedSwarm startsource_count, specialist_count, specialists
harness_swarm_completeSwarm endspecialists_succeeded, aggregator_succeeded, total_duration_ms, estimated_cost_usd
harness_subtask_startSubtask DAG layer beginssubtask_index, total_subtasks, title, workflow_type, parallel
harness_subtask_doneSubtask DAG layer endssubtask_index, total_subtasks, artifacts_created, is_last
harness_contractPost-flightpassed, score, issues, artifact_counts, reasoning
harness_verifierPost-flightpassed, score, issues, feedback, reasoning, contrastive
harness_rqsPost-flight (synthesis)score, grade, label, breakdown
harness_rec_qsPost-flight (recommendation)score, grade, label, breakdown
harness_doc_qsPost-flight (drafting)score, grade, label, breakdown
harness_workspace_profilePost-flightmaturity, total_runs, avg_quality, feedback_rate, common_issues, quality_multiplier
harness_retryBetween attemptsattempt, reason, previous_score
harness_degradedAll retries exhaustedattempts_made, best_score, workflow_type, issues
harness_iteration_completeDeep-synthesis iteration enditerations, best_score, accepted_count, total_duration_ms, estimated_cost_usd
context_metricsJust before donetotal_tool_calls, cumulative_result_chars, estimated_tool_result_tokens, largest_result, calls_per_tool
milestone_unlockedWorkspace crosses maturity thresholdlevel, threshold, total_runs, description, quality_multiplier

Tuning

1

Tune retry thresholds via settings

Edit app/core/config.py (or set env vars):

retry_good_enough_score: float = 0.70      # combined threshold above which no retry fires
retry_low_quality_threshold: float = 0.50  # verifier score below which quality retry triggers
retry_max_attempts_synthesis: int = 3      # max attempts for synthesis/full_pipeline
retry_max_attempts_chat: int = 2           # max attempts for chat
retry_max_attempts_default: int = 3        # max attempts for recommendation/drafting/other
2

Swap verifier prompts

Verifier system prompts are loaded via PromptRegistry.get(prompt_id, tier=tier) with hardcoded fallbacks. Override these IDs in the registry to change verifier behavior without a deploy:

  • verifier_system — standard verifier
  • verifier_accuracy, verifier_completeness, verifier_usefulness — contrastive perspectives
3

Adjust contract thresholds per workflow

Edit WorkflowContract._static_for_workflow in app/core/harness/contracts.py. All base contracts are declared in a single contracts dict — changes apply to every future run.

4

Disable swarm per user

Swarm activation is gated by the user's plan entitlement has_synthesis_swarm. Downgrade or upgrade the plan to toggle. There is no per-request disable flag.

The harness never raises to the user. Every component (ComplexityClassifier, VerifierAgent, PlanGenerator, workspace profile loading) has a fallback path that logs and returns a safe default. If Haiku is down, the run still completes — it just ships with score=0.0, feedback="[is_fallback]" so downstream retry logic knows verification was skipped.

Was this helpful?