AI Agent QA Guide
AI Agent Replay Debugging
Replay failed AI agent interactions with a practical checklist, immutable audit log template, and trace fields that help QA, engineering, and support teams reproduce production failures.
Do not log secrets, API keys, payment data, or private user data unless your retention, access, and compliance controls explicitly require it.
Trace ready
Pin the run, compare steps, and turn failures into regression checks.
Direct Answer
What is AI agent replay debugging?
AI agent replay debugging is the practice of reconstructing a failed agent run from its prompt, tool calls, inputs, outputs, approvals, errors, and environment metadata so a team can reproduce the issue, compare expected behavior, fix the cause, and preserve an audit trail.
Checklist
Failed agent run replay checklist.
Use this before changing prompts, tools, or model configuration. First preserve the evidence, then isolate the failure.
Save run ID, user-visible outcome, timestamp, model, prompt version, tool versions, environment, and release hash.
Snapshot retrieved documents, API responses, browser state, files, feature flags, and user-provided inputs used by the run.
Identify the exact prompt, tool call, approval, parser step, timeout, policy block, or external service response where behavior diverged.
Re-run with the same settings first. Then mock tools, pin retrieval, and vary one factor at a time to find the cause.
Map the run to requirements, acceptance criteria, and test cases so the team can decide whether the agent or the spec is wrong.
Turn the minimal failing trace into a Playwright, MCP, API, or manual QA scenario that can fail before the same bug ships again.
Immutable audit log and trace fields template
An immutable audit log should preserve enough information to explain a run without allowing later edits to hide what happened. Store append-only records, redact sensitive values, and keep retention rules explicit.
run_id:
session_id:
user_request_id:
timestamp_utc:
environment: production / staging / local
release_version:
agent_name:
model:
model_parameters:
prompt_version:
system_prompt_hash:
developer_prompt_hash:
user_prompt:
retrieved_context_ids:
tool_manifest_version:
tool_call_id:
tool_name:
tool_input:
tool_output:
tool_error:
approval_required: yes / no
approval_decision:
approval_actor:
final_agent_output:
user_visible_error:
security_policy_event:
retention_policy:
redaction_policy:
linked_test_case:
linked_requirement:
linked_incident:
Prompt, tool call, input, output, error, approval, and log retention fields
| Field group | Capture | Replay value |
|---|---|---|
| Prompt | System prompt hash, developer prompt hash, user prompt, prompt version, policy instructions. | Shows whether the model was asked the right thing and whether a prompt change caused the failure. |
| Tool call | Tool name, schema version, arguments, call order, latency, retries, timeout, and returned status. | Separates reasoning bugs from integration bugs, schema mismatch, stale tools, and flaky services. |
| Input | User inputs, retrieved documents, files, page state, API records, feature flags, and environment state. | Lets the team replay the same scenario instead of debugging against changed live data. |
| Output | Intermediate reasoning summary where allowed, tool outputs, final answer, UI action, or generated artifact. | Provides a step-by-step comparison between the original run and the replayed run. |
| Error | Exception, validation failure, parser error, policy block, refusal, timeout, or human escalation reason. | Identifies the failure mode and the smallest reproducible trace for regression coverage. |
| Approval | Approval type, approver, decision, timestamp, blocked action, and post-approval action taken. | Clarifies whether the agent failed before, during, or after a human approval gate. |
| Log retention | Retention period, redaction method, access role, deletion exception, and audit export location. | Keeps debugging useful without turning traces into unmanaged sensitive-data storage. |
Deterministic replay vs recording vs test case generation
| Method | Best for | Limits | QA handoff |
|---|---|---|---|
| Deterministic replay | High-risk agents, production incidents, audits, and regression fixes where the same inputs should reproduce the same steps. | Requires pinned model settings, stable prompts, captured context, mocked external calls, and careful data controls. | Create a minimal replay fixture and connect it to a regression suite. |
| Screenshot or video recording | Explaining what the user saw, UI timing, browser state, and support handoff. | Shows symptoms, but rarely captures hidden prompts, tool inputs, retrieved context, or API failures. | Attach to the incident, then link it to the underlying trace and test case. |
| Test case generation | Turning requirements, acceptance criteria, or incidents into planned manual or automated QA coverage. | Does not prove what happened in the failed run unless it is backed by trace evidence. | Use the generated case to prevent repeat failures after replay identifies the cause. |
Requirements, acceptance criteria, Playwright, and MCP testing links
Replay evidence becomes more valuable when it connects to normal QA artifacts. Start with the Jira test case template for test case fields, then use its AI workflow to turn requirements and acceptance criteria into reviewed cases. For browser agents, convert stable failures into Playwright checks. For tool-using agents, make an MCP-style tool contract test that verifies tool inputs, outputs, approval behavior, and error handling.
Useful internal QA path: failed run replay checklist to isolate the cause, trace template to preserve evidence, and test case template to create regression coverage.
AI agent replay debugging FAQ
How do you replay failed AI agent interactions?
Collect the original prompt, model settings, tool definitions, inputs, outputs, approvals, errors, environment metadata, and trace IDs. Re-run the agent in a controlled environment with external calls mocked or pinned, then compare each step against the original trace.
What should be logged?
Log prompts, tool calls, inputs, outputs, retrieved context, errors, approvals, user-visible messages, model configuration, tool versions, timestamps, correlation IDs, and log retention rules. Avoid storing secrets or unnecessary personal data.
How is this different from test case generation?
Replay debugging reconstructs what already happened in a failed agent run. Test case generation creates planned checks from requirements or acceptance criteria before or after development.
Is deterministic replay required?
Deterministic replay is ideal for high-risk production agents, audits, and regression investigations, but teams can start with structured traces, mocked tools, pinned inputs, and manual replay checklists before full determinism is available.
Once the failure is reproduced, document it with the same fields your team uses for requirements, acceptance criteria, and regression tests.