[Ch 8] Building an Evaluation System for Your Agent
How do you know if your agent is actually working well? User feedback is lagging and subjective. Manual inspection doesn’t scale. Unit tests can’t capture emergent LLM behavior. This chapter builds a principled evaluation pipeline that gives you a repeatable, quantitative answer.
Why Agent Evaluation Is Hard
Traditional software testing is deterministic: given input X, assert output Y. Agent evaluation is different:
| Challenge | Why It’s Hard |
|---|---|
| Non-determinism | Same input can produce different tool call sequences |
| Long dependency chains | A failure in step 3 is caused by a decision in step 1 |
| Multi-dimensional quality | Correctness, completeness, tone, and tool efficiency all matter |
| Latent regressions | Changing the system prompt can quietly degrade performance on edge cases |
| No ground truth | For open-ended tasks, there’s no single “correct” answer |
The answer is a layered evaluation approach — combine deterministic checks where you can, and LLM judges where you must.
The Three-Stage Pipeline
Installation
pip install deepeval pytest pytest-asyncio
# .env.example
OPENAI_API_KEY=your-api-key-here
DEEPEVAL_API_KEY=your-deepeval-key # optional: for Confident AI dashboard
Stage 1: Generate Test Data
The first stage runs your agent against a set of golden test cases — inputs where you know what a good response looks like — and saves the actual outputs for evaluation.
Define Test Cases
# evaluations/test_cases.py
from dataclasses import dataclass, field
@dataclass
class AgentTestCase:
"""A single evaluation scenario."""
id: str
user_input: str
expected_tool_calls: list[str] # tools that should have been called
expected_topics: list[str] # topics that should appear in the response
should_not_contain: list[str] = field(default_factory=list) # forbidden content
TEST_CASES = [
AgentTestCase(
id="tc-001-create-high-priority",
user_input="Create a high priority task to deploy the auth service to production",
expected_tool_calls=["create_task"],
expected_topics=["task", "created", "high", "deploy", "auth"],
),
AgentTestCase(
id="tc-002-list-filtered",
user_input="Show me all tasks that are currently in progress",
expected_tool_calls=["list_tasks"],
expected_topics=["in_progress", "task"],
should_not_contain=["todo", "done"],
),
AgentTestCase(
id="tc-003-policy-lookup",
user_input="What is the expense report deadline?",
expected_tool_calls=["search_knowledge_base"],
expected_topics=["30 days", "receipt", "expense"],
),
AgentTestCase(
id="tc-004-multi-step",
user_input="Create a task to review the Q1 report, then tell me all my current tasks",
expected_tool_calls=["create_task", "list_tasks"],
expected_topics=["created", "Q1", "review"],
),
AgentTestCase(
id="tc-005-error-handling",
user_input="Update the status of task-999 to done",
expected_tool_calls=["update_task_status"],
expected_topics=["not found", "error", "task-999"],
),
]
Run Agent and Collect Outputs
# evaluations/01_generate_test_data.py
"""
Stage 1: Run the agent against all test cases and save actual outputs.
Run this script whenever you want to refresh the evaluation dataset.
"""
import asyncio
import json
import os
from pathlib import Path
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
from dotenv import load_dotenv
from agent.graph import build_graph
from evaluations.test_cases import TEST_CASES
load_dotenv()
OUTPUT_DIR = Path("evaluations/outputs")
async def collect_agent_output(app, test_case) -> dict:
"""Run one test case and collect structured output."""
config = {"configurable": {"thread_id": f"eval-{test_case.id}"}}
final_state = app.invoke(
{"messages": [HumanMessage(content=test_case.user_input)]},
config=config,
)
messages = final_state["messages"]
# Extract the tool calls that were actually made
actual_tool_calls = []
for msg in messages:
if isinstance(msg, AIMessage) and msg.tool_calls:
actual_tool_calls.extend(c["name"] for c in msg.tool_calls)
# Extract tool results
tool_results = [
msg.content for msg in messages if isinstance(msg, ToolMessage)
]
# Final response
final_response = messages[-1].content if messages else ""
return {
"test_case_id": test_case.id,
"user_input": test_case.user_input,
"final_response": final_response,
"actual_tool_calls": actual_tool_calls,
"tool_results": tool_results,
"message_count": len(messages),
}
async def main():
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
app = build_graph(db_path=":memory:") # use in-memory DB for eval runs
results = []
for tc in TEST_CASES:
print(f"Running [{tc.id}]: {tc.user_input[:60]}...")
output = await collect_agent_output(app, tc)
results.append(output)
# Save individual output for inspection
out_file = OUTPUT_DIR / f"{tc.id}.json"
out_file.write_text(json.dumps(output, indent=2))
# Save combined dataset
dataset_file = OUTPUT_DIR / "dataset.json"
dataset_file.write_text(json.dumps(results, indent=2))
print(f"\n✅ Saved {len(results)} outputs to {OUTPUT_DIR}/")
if __name__ == "__main__":
asyncio.run(main())
Stage 2: Rule-Based Evaluation
Fast, deterministic checks that run in milliseconds. These catch obvious failures: wrong tools called, required topics missing, forbidden content present.
# evaluations/02_rule_based_eval.py
"""
Stage 2: Rule-based assertions using pytest.
Run: pytest evaluations/02_rule_based_eval.py -v
"""
import json
import pytest
from pathlib import Path
from evaluations.test_cases import TEST_CASES, AgentTestCase
OUTPUT_DIR = Path("evaluations/outputs")
def load_output(test_case_id: str) -> dict:
output_file = OUTPUT_DIR / f"{test_case_id}.json"
if not output_file.exists():
pytest.skip(f"Output for {test_case_id} not found — run Stage 1 first.")
return json.loads(output_file.read_text())
# Parametrize over all test cases
@pytest.mark.parametrize("tc", TEST_CASES, ids=lambda tc: tc.id)
class TestRuleBasedEval:
def test_expected_tools_were_called(self, tc: AgentTestCase):
"""Every expected tool must appear in actual_tool_calls."""
output = load_output(tc.id)
actual = output["actual_tool_calls"]
for expected_tool in tc.expected_tool_calls:
assert expected_tool in actual, (
f"Expected tool '{expected_tool}' was not called.\n"
f"Actual calls: {actual}"
)
def test_response_not_empty(self, tc: AgentTestCase):
"""Final response must be non-empty."""
output = load_output(tc.id)
assert output["final_response"].strip(), "Final response is empty"
def test_response_minimum_length(self, tc: AgentTestCase):
"""Response should be at least 20 characters (not just a status code)."""
output = load_output(tc.id)
assert len(output["final_response"]) >= 20, (
f"Response too short: '{output['final_response']}'"
)
def test_expected_topics_present(self, tc: AgentTestCase):
"""All expected topics should appear in the final response (case-insensitive)."""
output = load_output(tc.id)
response_lower = output["final_response"].lower()
missing = [t for t in tc.expected_topics if t.lower() not in response_lower]
assert not missing, (
f"Missing expected topics: {missing}\n"
f"Response: {output['final_response'][:200]}"
)
def test_forbidden_content_absent(self, tc: AgentTestCase):
"""No forbidden phrases should appear in the response."""
if not tc.should_not_contain:
pytest.skip("No forbidden content defined for this test case")
output = load_output(tc.id)
response_lower = output["final_response"].lower()
found = [f for f in tc.should_not_contain if f.lower() in response_lower]
assert not found, (
f"Forbidden content found: {found}\n"
f"Response: {output['final_response'][:200]}"
)
def test_no_extra_tool_calls(self, tc: AgentTestCase):
"""Agent should not call more tools than expected (efficiency check)."""
output = load_output(tc.id)
actual_count = len(output["actual_tool_calls"])
expected_count = len(tc.expected_tool_calls)
# Allow up to 1 extra call (e.g., for retries), but flag excessive calls
assert actual_count <= expected_count + 1, (
f"Too many tool calls: expected {expected_count}, got {actual_count}\n"
f"Calls made: {output['actual_tool_calls']}"
)
Stage 3: LLM-as-Judge with DeepEval GEval
Rule-based checks miss nuanced quality issues: a response can contain the right keywords but still be poorly reasoned, incomplete, or unhelpful. For this, we use an LLM to judge the response against a detailed rubric.
What is GEval?
GEval (Generation Evaluation) is DeepEval’s framework for defining custom LLM evaluation criteria with step-by-step reasoning. Each metric has:
- A name — what you’re measuring
- Evaluation steps — the rubric the judge LLM follows
- A weight — how much this metric contributes to the overall score
Define Custom Metrics
# evaluations/metrics.py
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
# Metric 1: Task Action Accuracy
# Does the response correctly reflect what the tools actually did?
task_accuracy = GEval(
name="Task Action Accuracy",
evaluation_steps=[
"Read the user's request and the agent's final response carefully.",
"Check if the response accurately reflects the actions taken (e.g., task created, list shown).",
"Verify that task IDs, titles, priorities, and statuses mentioned in the response match the tool results.",
"Penalize heavily if the response claims an action was performed that was not, or vice versa.",
"Score 1.0 if fully accurate, 0.5 if partially accurate, 0.0 if inaccurate.",
],
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.RETRIEVAL_CONTEXT, # tool results as context
],
threshold=0.7,
)
# Metric 2: Response Completeness
# Does the response answer all parts of the user's request?
completeness = GEval(
name="Response Completeness",
evaluation_steps=[
"List all distinct requests made by the user in their message.",
"For each request, determine whether the response addresses it.",
"A response is complete if it addresses all requests and provides necessary details.",
"Partial credit if most but not all requests are addressed.",
"Score 1.0 if fully complete, 0.0 if the main request is unanswered.",
],
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
],
threshold=0.7,
)
# Metric 3: Conciseness
# Is the response appropriately brief without losing key information?
conciseness = GEval(
name="Conciseness",
evaluation_steps=[
"Assess whether the response contains unnecessary repetition or filler text.",
"Good responses confirm actions clearly without restating the user's input verbatim.",
"Penalize responses that are verbose or padded; reward tight, informative ones.",
"A one-sentence confirmation for a simple action is ideal.",
"Score 1.0 if concise, 0.5 if somewhat verbose, 0.0 if excessively padded.",
],
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
],
threshold=0.6,
)
# Weighted combination
WEIGHTED_METRICS = [
(task_accuracy, 0.50), # most important: did it do the right thing?
(completeness, 0.30), # did it address everything?
(conciseness, 0.20), # was it well-phrased?
]
Run the LLM Evaluation
# evaluations/03_llm_eval.py
"""
Stage 3: LLM-as-judge evaluation using DeepEval GEval.
Run: python evaluations/03_llm_eval.py
"""
import json
from pathlib import Path
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from evaluations.test_cases import TEST_CASES
from evaluations.metrics import WEIGHTED_METRICS, task_accuracy, completeness, conciseness
OUTPUT_DIR = Path("evaluations/outputs")
def load_output(test_case_id: str) -> dict | None:
output_file = OUTPUT_DIR / f"{test_case_id}.json"
if not output_file.exists():
return None
return json.loads(output_file.read_text())
def build_llm_test_case(tc, output: dict) -> LLMTestCase:
"""Map our AgentTestCase + agent output to DeepEval's LLMTestCase."""
return LLMTestCase(
input=tc.user_input,
actual_output=output["final_response"],
retrieval_context=output["tool_results"], # tool results inform accuracy check
expected_output=f"Should mention: {', '.join(tc.expected_topics)}",
)
def compute_weighted_score(test_case: LLMTestCase) -> float:
"""Compute a weighted overall score across all metrics."""
total_weight = 0.0
total_score = 0.0
for metric, weight in WEIGHTED_METRICS:
metric.measure(test_case)
total_score += metric.score * weight
total_weight += weight
return total_score / total_weight if total_weight > 0 else 0.0
def main():
llm_test_cases = []
for tc in TEST_CASES:
output = load_output(tc.id)
if output is None:
print(f"⚠️ Skipping {tc.id} — run Stage 1 first.")
continue
llm_test_cases.append((tc, build_llm_test_case(tc, output)))
if not llm_test_cases:
print("No test cases to evaluate. Run Stage 1 first.")
return
print(f"\nEvaluating {len(llm_test_cases)} test cases with LLM judge...\n")
results = []
for tc, ltc in llm_test_cases:
weighted_score = compute_weighted_score(ltc)
passed = weighted_score >= 0.7 # overall passing threshold
results.append({
"id": tc.id,
"input": tc.user_input[:60],
"weighted_score": round(weighted_score, 3),
"passed": passed,
})
status = "✅ PASS" if passed else "❌ FAIL"
print(f"[{status}] {tc.id}: {weighted_score:.3f}")
# ── Run DeepEval's built-in evaluate() for the full report ───────────────
all_ltc = [ltc for _, ltc in llm_test_cases]
all_metrics = [task_accuracy, completeness, conciseness]
evaluate(test_cases=all_ltc, metrics=all_metrics)
# ── Print summary ──────────────────────────────────────────────────────────
passed_count = sum(1 for r in results if r["passed"])
avg_score = sum(r["weighted_score"] for r in results) / len(results)
print(f"\n{'='*50}")
print(f"Results: {passed_count}/{len(results)} passed")
print(f"Average weighted score: {avg_score:.3f}")
print(f"{'='*50}")
# Save results for trend tracking
results_file = OUTPUT_DIR / "llm_eval_results.json"
results_file.write_text(json.dumps(results, indent=2))
print(f"\nDetailed results saved to {results_file}")
if __name__ == "__main__":
main()
Running the Full Pipeline
# Step 1: Generate test data (run the agent on all golden inputs)
python evaluations/01_generate_test_data.py
# Step 2: Rule-based checks (fast, < 1 second)
pytest evaluations/02_rule_based_eval.py -v
# Step 3: LLM-as-judge (slower, ~30-60 seconds depending on case count)
python evaluations/03_llm_eval.py
Sample output from Stage 2:
PASSED evaluations/02_rule_based_eval.py::TestRuleBasedEval::test_expected_tools_were_called[tc-001]
PASSED evaluations/02_rule_based_eval.py::TestRuleBasedEval::test_expected_topics_present[tc-001]
FAILED evaluations/02_rule_based_eval.py::TestRuleBasedEval::test_expected_topics_present[tc-005]
AssertionError: Missing expected topics: ['not found']
Sample output from Stage 3:
[✅ PASS] tc-001-create-high-priority: 0.891
[✅ PASS] tc-002-list-filtered: 0.843
[✅ PASS] tc-003-policy-lookup: 0.776
[✅ PASS] tc-004-multi-step: 0.801
[❌ FAIL] tc-005-error-handling: 0.623
==================================================
Results: 4/5 passed
Average weighted score: 0.787
==================================================
When to Run Evaluations
| Trigger | Stage to Run | Why |
|---|---|---|
| After any system prompt change | Stage 2 + 3 | Prompt changes silently affect behavior |
| After adding/modifying a tool | Stage 2 | Check tool call patterns |
| After upgrading the LLM version | All 3 | Model updates change behavior |
| Before every production release | All 3 | Full regression check |
| Weekly (scheduled) | All 3 | Catch model drift from provider updates |
.env.example
# .env.example
OPENAI_API_KEY=your-api-key-here
DEEPEVAL_API_KEY=your-deepeval-key # optional: enables Confident AI dashboard
💡 Ollama note: DeepEval’s GEval by default uses OpenAI as the judge LLM. To use a local model as the judge, configure it with:
from deepeval.models import DeepEvalBaseLLM # Implement a custom judge class wrapping OllamaLLM # See: https://docs.confident-ai.com/docs/metrics-custom-judgeNote: smaller models (< 13B parameters) often produce inconsistent rubric scores. For reliable LLM judging, use at least GPT-4o-mini or an equivalent capable model.
Summary
| Stage | Method | Speed | What It Catches |
|---|---|---|---|
| 1 | Agent execution on golden inputs | ~5–30s | Nothing — generates data |
| 2 | pytest rule-based assertions | < 1s | Wrong tools, missing topics, forbidden content |
| 3 | DeepEval GEval LLM-as-judge | ~30–60s | Nuanced quality, accuracy, completeness |
Series Complete
You’ve now built a production-grade AI agent from scratch:
| Chapter | What You Built |
|---|---|
| Ch 1 | Mental model: agent loop, ReAct, when to use agents |
| Ch 2 | Context window anatomy, token budget management |
| Ch 3 | LangChain + LangGraph fundamentals |
| Ch 4 | Full multi-turn agent with SQLite persistence and streaming |
| Ch 5 | AgentContext injection, NeMo Guardrails, HITL interrupts |
| Ch 6 | Four-layer memory: trim, MongoDB, FAISS RAG, system prompt |
| Ch 7 | Full Langfuse observability: traces, spans, cost tracking |
| Ch 8 | Three-stage evaluation pipeline with DeepEval GEval |
The full source code for this series is organized as a single Python package. Combine these pieces and you have a system that is observable, safe, memory-aware, and measurably correct.
