AI Agent QA Guide

AI Agent Replay Debugging

Replay failed AI agent interactions with a practical checklist, immutable audit log template, and trace fields that help QA, engineering, and support teams reproduce production failures.

Do not log secrets, API keys, payment data, or private user data unless your retention, access, and compliance controls explicitly require it.

PromptToolErrorReplay
QA

Trace ready

Pin the run, compare steps, and turn failures into regression checks.

Direct Answer

What is AI agent replay debugging?

AI agent replay debugging is the practice of reconstructing a failed agent run from its prompt, tool calls, inputs, outputs, approvals, errors, and environment metadata so a team can reproduce the issue, compare expected behavior, fix the cause, and preserve an audit trail.

Checklist

Failed agent run replay checklist.

Use this before changing prompts, tools, or model configuration. First preserve the evidence, then isolate the failure.

Capture the failing run

Save run ID, user-visible outcome, timestamp, model, prompt version, tool versions, environment, and release hash.

Freeze volatile inputs

Snapshot retrieved documents, API responses, browser state, files, feature flags, and user-provided inputs used by the run.

Mark the failure point

Identify the exact prompt, tool call, approval, parser step, timeout, policy block, or external service response where behavior diverged.

Replay with controls

Re-run with the same settings first. Then mock tools, pin retrieval, and vary one factor at a time to find the cause.

Compare expected behavior

Map the run to requirements, acceptance criteria, and test cases so the team can decide whether the agent or the spec is wrong.

Create a regression check

Turn the minimal failing trace into a Playwright, MCP, API, or manual QA scenario that can fail before the same bug ships again.

Immutable audit log and trace fields template

An immutable audit log should preserve enough information to explain a run without allowing later edits to hide what happened. Store append-only records, redact sensitive values, and keep retention rules explicit.

run_id:
session_id:
user_request_id:
timestamp_utc:
environment: production / staging / local
release_version:
agent_name:
model:
model_parameters:
prompt_version:
system_prompt_hash:
developer_prompt_hash:
user_prompt:
retrieved_context_ids:
tool_manifest_version:
tool_call_id:
tool_name:
tool_input:
tool_output:
tool_error:
approval_required: yes / no
approval_decision:
approval_actor:
final_agent_output:
user_visible_error:
security_policy_event:
retention_policy:
redaction_policy:
linked_test_case:
linked_requirement:
linked_incident:

Prompt, tool call, input, output, error, approval, and log retention fields

Trace fields to capture for replay and auditability
Field group Capture Replay value
Prompt System prompt hash, developer prompt hash, user prompt, prompt version, policy instructions. Shows whether the model was asked the right thing and whether a prompt change caused the failure.
Tool call Tool name, schema version, arguments, call order, latency, retries, timeout, and returned status. Separates reasoning bugs from integration bugs, schema mismatch, stale tools, and flaky services.
Input User inputs, retrieved documents, files, page state, API records, feature flags, and environment state. Lets the team replay the same scenario instead of debugging against changed live data.
Output Intermediate reasoning summary where allowed, tool outputs, final answer, UI action, or generated artifact. Provides a step-by-step comparison between the original run and the replayed run.
Error Exception, validation failure, parser error, policy block, refusal, timeout, or human escalation reason. Identifies the failure mode and the smallest reproducible trace for regression coverage.
Approval Approval type, approver, decision, timestamp, blocked action, and post-approval action taken. Clarifies whether the agent failed before, during, or after a human approval gate.
Log retention Retention period, redaction method, access role, deletion exception, and audit export location. Keeps debugging useful without turning traces into unmanaged sensitive-data storage.

Deterministic replay vs recording vs test case generation

Which evidence method fits the failure?
Method Best for Limits QA handoff
Deterministic replay High-risk agents, production incidents, audits, and regression fixes where the same inputs should reproduce the same steps. Requires pinned model settings, stable prompts, captured context, mocked external calls, and careful data controls. Create a minimal replay fixture and connect it to a regression suite.
Screenshot or video recording Explaining what the user saw, UI timing, browser state, and support handoff. Shows symptoms, but rarely captures hidden prompts, tool inputs, retrieved context, or API failures. Attach to the incident, then link it to the underlying trace and test case.
Test case generation Turning requirements, acceptance criteria, or incidents into planned manual or automated QA coverage. Does not prove what happened in the failed run unless it is backed by trace evidence. Use the generated case to prevent repeat failures after replay identifies the cause.

Requirements, acceptance criteria, Playwright, and MCP testing links

Replay evidence becomes more valuable when it connects to normal QA artifacts. Start with the Jira test case template for test case fields, then use its AI workflow to turn requirements and acceptance criteria into reviewed cases. For browser agents, convert stable failures into Playwright checks. For tool-using agents, make an MCP-style tool contract test that verifies tool inputs, outputs, approval behavior, and error handling.

Useful internal QA path: failed run replay checklist to isolate the cause, trace template to preserve evidence, and test case template to create regression coverage.

AI agent replay debugging FAQ

How do you replay failed AI agent interactions?

Collect the original prompt, model settings, tool definitions, inputs, outputs, approvals, errors, environment metadata, and trace IDs. Re-run the agent in a controlled environment with external calls mocked or pinned, then compare each step against the original trace.

What should be logged?

Log prompts, tool calls, inputs, outputs, retrieved context, errors, approvals, user-visible messages, model configuration, tool versions, timestamps, correlation IDs, and log retention rules. Avoid storing secrets or unnecessary personal data.

How is this different from test case generation?

Replay debugging reconstructs what already happened in a failed agent run. Test case generation creates planned checks from requirements or acceptance criteria before or after development.

Is deterministic replay required?

Deterministic replay is ideal for high-risk production agents, audits, and regression investigations, but teams can start with structured traces, mocked tools, pinned inputs, and manual replay checklists before full determinism is available.

Turn a replay into a QA case.

Once the failure is reproduced, document it with the same fields your team uses for requirements, acceptance criteria, and regression tests.

Open Test Case Template