[Ch 8] Building an Evaluation System for Your Agent

Apr 7, 2026 · 10 min read
blog AI Agent

How do you know if your agent is actually working well? User feedback is lagging and subjective. Manual inspection doesn’t scale. Unit tests can’t capture emergent LLM behavior. This chapter builds a principled evaluation pipeline that gives you a repeatable, quantitative answer.


Why Agent Evaluation Is Hard

Traditional software testing is deterministic: given input X, assert output Y. Agent evaluation is different:

ChallengeWhy It’s Hard
Non-determinismSame input can produce different tool call sequences
Long dependency chainsA failure in step 3 is caused by a decision in step 1
Multi-dimensional qualityCorrectness, completeness, tone, and tool efficiency all matter
Latent regressionsChanging the system prompt can quietly degrade performance on edge cases
No ground truthFor open-ended tasks, there’s no single “correct” answer

The answer is a layered evaluation approach — combine deterministic checks where you can, and LLM judges where you must.


The Three-Stage Pipeline

graph LR A["📝 Stage 1\nGenerate Test Data\n(run agent on golden inputs)"] --> B["✅ Stage 2\nRule-based Eval\n(pytest assertions)"] B --> C["🤖 Stage 3\nLLM-as-Judge\n(DeepEval GEval)"] C --> D["📊 Score Report\n& Regression Detection"] style A fill:#2563EB,color:#fff,stroke:none style B fill:#059669,color:#fff,stroke:none style C fill:#D97706,color:#fff,stroke:none style D fill:#7C3AED,color:#fff,stroke:none
Fig 1: Three-stage evaluation pipeline — from test data generation to scored report

Installation

pip install deepeval pytest pytest-asyncio
# .env.example
OPENAI_API_KEY=your-api-key-here
DEEPEVAL_API_KEY=your-deepeval-key  # optional: for Confident AI dashboard

Stage 1: Generate Test Data

The first stage runs your agent against a set of golden test cases — inputs where you know what a good response looks like — and saves the actual outputs for evaluation.

Define Test Cases

# evaluations/test_cases.py
from dataclasses import dataclass, field

@dataclass
class AgentTestCase:
    """A single evaluation scenario."""
    id: str
    user_input: str
    expected_tool_calls: list[str]    # tools that should have been called
    expected_topics: list[str]        # topics that should appear in the response
    should_not_contain: list[str] = field(default_factory=list)  # forbidden content

TEST_CASES = [
    AgentTestCase(
        id="tc-001-create-high-priority",
        user_input="Create a high priority task to deploy the auth service to production",
        expected_tool_calls=["create_task"],
        expected_topics=["task", "created", "high", "deploy", "auth"],
    ),
    AgentTestCase(
        id="tc-002-list-filtered",
        user_input="Show me all tasks that are currently in progress",
        expected_tool_calls=["list_tasks"],
        expected_topics=["in_progress", "task"],
        should_not_contain=["todo", "done"],
    ),
    AgentTestCase(
        id="tc-003-policy-lookup",
        user_input="What is the expense report deadline?",
        expected_tool_calls=["search_knowledge_base"],
        expected_topics=["30 days", "receipt", "expense"],
    ),
    AgentTestCase(
        id="tc-004-multi-step",
        user_input="Create a task to review the Q1 report, then tell me all my current tasks",
        expected_tool_calls=["create_task", "list_tasks"],
        expected_topics=["created", "Q1", "review"],
    ),
    AgentTestCase(
        id="tc-005-error-handling",
        user_input="Update the status of task-999 to done",
        expected_tool_calls=["update_task_status"],
        expected_topics=["not found", "error", "task-999"],
    ),
]

Run Agent and Collect Outputs

# evaluations/01_generate_test_data.py
"""
Stage 1: Run the agent against all test cases and save actual outputs.
Run this script whenever you want to refresh the evaluation dataset.
"""
import asyncio
import json
import os
from pathlib import Path
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
from dotenv import load_dotenv
from agent.graph import build_graph
from evaluations.test_cases import TEST_CASES

load_dotenv()
OUTPUT_DIR = Path("evaluations/outputs")

async def collect_agent_output(app, test_case) -> dict:
    """Run one test case and collect structured output."""
    config = {"configurable": {"thread_id": f"eval-{test_case.id}"}}
    final_state = app.invoke(
        {"messages": [HumanMessage(content=test_case.user_input)]},
        config=config,
    )
    messages = final_state["messages"]

    # Extract the tool calls that were actually made
    actual_tool_calls = []
    for msg in messages:
        if isinstance(msg, AIMessage) and msg.tool_calls:
            actual_tool_calls.extend(c["name"] for c in msg.tool_calls)

    # Extract tool results
    tool_results = [
        msg.content for msg in messages if isinstance(msg, ToolMessage)
    ]

    # Final response
    final_response = messages[-1].content if messages else ""

    return {
        "test_case_id": test_case.id,
        "user_input": test_case.user_input,
        "final_response": final_response,
        "actual_tool_calls": actual_tool_calls,
        "tool_results": tool_results,
        "message_count": len(messages),
    }


async def main():
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    app = build_graph(db_path=":memory:")   # use in-memory DB for eval runs

    results = []
    for tc in TEST_CASES:
        print(f"Running [{tc.id}]: {tc.user_input[:60]}...")
        output = await collect_agent_output(app, tc)
        results.append(output)

        # Save individual output for inspection
        out_file = OUTPUT_DIR / f"{tc.id}.json"
        out_file.write_text(json.dumps(output, indent=2))

    # Save combined dataset
    dataset_file = OUTPUT_DIR / "dataset.json"
    dataset_file.write_text(json.dumps(results, indent=2))
    print(f"\n✅ Saved {len(results)} outputs to {OUTPUT_DIR}/")


if __name__ == "__main__":
    asyncio.run(main())

Stage 2: Rule-Based Evaluation

Fast, deterministic checks that run in milliseconds. These catch obvious failures: wrong tools called, required topics missing, forbidden content present.

# evaluations/02_rule_based_eval.py
"""
Stage 2: Rule-based assertions using pytest.
Run: pytest evaluations/02_rule_based_eval.py -v
"""
import json
import pytest
from pathlib import Path
from evaluations.test_cases import TEST_CASES, AgentTestCase

OUTPUT_DIR = Path("evaluations/outputs")


def load_output(test_case_id: str) -> dict:
    output_file = OUTPUT_DIR / f"{test_case_id}.json"
    if not output_file.exists():
        pytest.skip(f"Output for {test_case_id} not found — run Stage 1 first.")
    return json.loads(output_file.read_text())


# Parametrize over all test cases
@pytest.mark.parametrize("tc", TEST_CASES, ids=lambda tc: tc.id)
class TestRuleBasedEval:

    def test_expected_tools_were_called(self, tc: AgentTestCase):
        """Every expected tool must appear in actual_tool_calls."""
        output = load_output(tc.id)
        actual = output["actual_tool_calls"]
        for expected_tool in tc.expected_tool_calls:
            assert expected_tool in actual, (
                f"Expected tool '{expected_tool}' was not called.\n"
                f"Actual calls: {actual}"
            )

    def test_response_not_empty(self, tc: AgentTestCase):
        """Final response must be non-empty."""
        output = load_output(tc.id)
        assert output["final_response"].strip(), "Final response is empty"

    def test_response_minimum_length(self, tc: AgentTestCase):
        """Response should be at least 20 characters (not just a status code)."""
        output = load_output(tc.id)
        assert len(output["final_response"]) >= 20, (
            f"Response too short: '{output['final_response']}'"
        )

    def test_expected_topics_present(self, tc: AgentTestCase):
        """All expected topics should appear in the final response (case-insensitive)."""
        output = load_output(tc.id)
        response_lower = output["final_response"].lower()
        missing = [t for t in tc.expected_topics if t.lower() not in response_lower]
        assert not missing, (
            f"Missing expected topics: {missing}\n"
            f"Response: {output['final_response'][:200]}"
        )

    def test_forbidden_content_absent(self, tc: AgentTestCase):
        """No forbidden phrases should appear in the response."""
        if not tc.should_not_contain:
            pytest.skip("No forbidden content defined for this test case")
        output = load_output(tc.id)
        response_lower = output["final_response"].lower()
        found = [f for f in tc.should_not_contain if f.lower() in response_lower]
        assert not found, (
            f"Forbidden content found: {found}\n"
            f"Response: {output['final_response'][:200]}"
        )

    def test_no_extra_tool_calls(self, tc: AgentTestCase):
        """Agent should not call more tools than expected (efficiency check)."""
        output = load_output(tc.id)
        actual_count = len(output["actual_tool_calls"])
        expected_count = len(tc.expected_tool_calls)
        # Allow up to 1 extra call (e.g., for retries), but flag excessive calls
        assert actual_count <= expected_count + 1, (
            f"Too many tool calls: expected {expected_count}, got {actual_count}\n"
            f"Calls made: {output['actual_tool_calls']}"
        )

Stage 3: LLM-as-Judge with DeepEval GEval

Rule-based checks miss nuanced quality issues: a response can contain the right keywords but still be poorly reasoned, incomplete, or unhelpful. For this, we use an LLM to judge the response against a detailed rubric.

What is GEval?

GEval (Generation Evaluation) is DeepEval’s framework for defining custom LLM evaluation criteria with step-by-step reasoning. Each metric has:

  • A name — what you’re measuring
  • Evaluation steps — the rubric the judge LLM follows
  • A weight — how much this metric contributes to the overall score

Define Custom Metrics

# evaluations/metrics.py
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

# Metric 1: Task Action Accuracy
# Does the response correctly reflect what the tools actually did?
task_accuracy = GEval(
    name="Task Action Accuracy",
    evaluation_steps=[
        "Read the user's request and the agent's final response carefully.",
        "Check if the response accurately reflects the actions taken (e.g., task created, list shown).",
        "Verify that task IDs, titles, priorities, and statuses mentioned in the response match the tool results.",
        "Penalize heavily if the response claims an action was performed that was not, or vice versa.",
        "Score 1.0 if fully accurate, 0.5 if partially accurate, 0.0 if inaccurate.",
    ],
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.RETRIEVAL_CONTEXT,  # tool results as context
    ],
    threshold=0.7,
)

# Metric 2: Response Completeness
# Does the response answer all parts of the user's request?
completeness = GEval(
    name="Response Completeness",
    evaluation_steps=[
        "List all distinct requests made by the user in their message.",
        "For each request, determine whether the response addresses it.",
        "A response is complete if it addresses all requests and provides necessary details.",
        "Partial credit if most but not all requests are addressed.",
        "Score 1.0 if fully complete, 0.0 if the main request is unanswered.",
    ],
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
    ],
    threshold=0.7,
)

# Metric 3: Conciseness
# Is the response appropriately brief without losing key information?
conciseness = GEval(
    name="Conciseness",
    evaluation_steps=[
        "Assess whether the response contains unnecessary repetition or filler text.",
        "Good responses confirm actions clearly without restating the user's input verbatim.",
        "Penalize responses that are verbose or padded; reward tight, informative ones.",
        "A one-sentence confirmation for a simple action is ideal.",
        "Score 1.0 if concise, 0.5 if somewhat verbose, 0.0 if excessively padded.",
    ],
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
    ],
    threshold=0.6,
)

# Weighted combination
WEIGHTED_METRICS = [
    (task_accuracy, 0.50),   # most important: did it do the right thing?
    (completeness,  0.30),   # did it address everything?
    (conciseness,   0.20),   # was it well-phrased?
]

Run the LLM Evaluation

# evaluations/03_llm_eval.py
"""
Stage 3: LLM-as-judge evaluation using DeepEval GEval.
Run: python evaluations/03_llm_eval.py
"""
import json
from pathlib import Path
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from evaluations.test_cases import TEST_CASES
from evaluations.metrics import WEIGHTED_METRICS, task_accuracy, completeness, conciseness

OUTPUT_DIR = Path("evaluations/outputs")


def load_output(test_case_id: str) -> dict | None:
    output_file = OUTPUT_DIR / f"{test_case_id}.json"
    if not output_file.exists():
        return None
    return json.loads(output_file.read_text())


def build_llm_test_case(tc, output: dict) -> LLMTestCase:
    """Map our AgentTestCase + agent output to DeepEval's LLMTestCase."""
    return LLMTestCase(
        input=tc.user_input,
        actual_output=output["final_response"],
        retrieval_context=output["tool_results"],   # tool results inform accuracy check
        expected_output=f"Should mention: {', '.join(tc.expected_topics)}",
    )


def compute_weighted_score(test_case: LLMTestCase) -> float:
    """Compute a weighted overall score across all metrics."""
    total_weight = 0.0
    total_score  = 0.0

    for metric, weight in WEIGHTED_METRICS:
        metric.measure(test_case)
        total_score  += metric.score * weight
        total_weight += weight

    return total_score / total_weight if total_weight > 0 else 0.0


def main():
    llm_test_cases = []
    for tc in TEST_CASES:
        output = load_output(tc.id)
        if output is None:
            print(f"⚠️  Skipping {tc.id} — run Stage 1 first.")
            continue
        llm_test_cases.append((tc, build_llm_test_case(tc, output)))

    if not llm_test_cases:
        print("No test cases to evaluate. Run Stage 1 first.")
        return

    print(f"\nEvaluating {len(llm_test_cases)} test cases with LLM judge...\n")

    results = []
    for tc, ltc in llm_test_cases:
        weighted_score = compute_weighted_score(ltc)
        passed = weighted_score >= 0.7   # overall passing threshold

        results.append({
            "id": tc.id,
            "input": tc.user_input[:60],
            "weighted_score": round(weighted_score, 3),
            "passed": passed,
        })

        status = "✅ PASS" if passed else "❌ FAIL"
        print(f"[{status}] {tc.id}: {weighted_score:.3f}")

    # ── Run DeepEval's built-in evaluate() for the full report ───────────────
    all_ltc = [ltc for _, ltc in llm_test_cases]
    all_metrics = [task_accuracy, completeness, conciseness]
    evaluate(test_cases=all_ltc, metrics=all_metrics)

    # ── Print summary ──────────────────────────────────────────────────────────
    passed_count = sum(1 for r in results if r["passed"])
    avg_score = sum(r["weighted_score"] for r in results) / len(results)
    print(f"\n{'='*50}")
    print(f"Results: {passed_count}/{len(results)} passed")
    print(f"Average weighted score: {avg_score:.3f}")
    print(f"{'='*50}")

    # Save results for trend tracking
    results_file = OUTPUT_DIR / "llm_eval_results.json"
    results_file.write_text(json.dumps(results, indent=2))
    print(f"\nDetailed results saved to {results_file}")


if __name__ == "__main__":
    main()

Running the Full Pipeline

# Step 1: Generate test data (run the agent on all golden inputs)
python evaluations/01_generate_test_data.py

# Step 2: Rule-based checks (fast, < 1 second)
pytest evaluations/02_rule_based_eval.py -v

# Step 3: LLM-as-judge (slower, ~30-60 seconds depending on case count)
python evaluations/03_llm_eval.py

Sample output from Stage 2:

PASSED  evaluations/02_rule_based_eval.py::TestRuleBasedEval::test_expected_tools_were_called[tc-001]
PASSED  evaluations/02_rule_based_eval.py::TestRuleBasedEval::test_expected_topics_present[tc-001]
FAILED  evaluations/02_rule_based_eval.py::TestRuleBasedEval::test_expected_topics_present[tc-005]
AssertionError: Missing expected topics: ['not found']

Sample output from Stage 3:

[✅ PASS] tc-001-create-high-priority: 0.891
[✅ PASS] tc-002-list-filtered: 0.843
[✅ PASS] tc-003-policy-lookup: 0.776
[✅ PASS] tc-004-multi-step: 0.801
[❌ FAIL] tc-005-error-handling: 0.623

==================================================
Results: 4/5 passed
Average weighted score: 0.787
==================================================

When to Run Evaluations

TriggerStage to RunWhy
After any system prompt changeStage 2 + 3Prompt changes silently affect behavior
After adding/modifying a toolStage 2Check tool call patterns
After upgrading the LLM versionAll 3Model updates change behavior
Before every production releaseAll 3Full regression check
Weekly (scheduled)All 3Catch model drift from provider updates

.env.example

# .env.example
OPENAI_API_KEY=your-api-key-here
DEEPEVAL_API_KEY=your-deepeval-key   # optional: enables Confident AI dashboard

💡 Ollama note: DeepEval’s GEval by default uses OpenAI as the judge LLM. To use a local model as the judge, configure it with:

from deepeval.models import DeepEvalBaseLLM
# Implement a custom judge class wrapping OllamaLLM
# See: https://docs.confident-ai.com/docs/metrics-custom-judge

Note: smaller models (< 13B parameters) often produce inconsistent rubric scores. For reliable LLM judging, use at least GPT-4o-mini or an equivalent capable model.


Summary

StageMethodSpeedWhat It Catches
1Agent execution on golden inputs~5–30sNothing — generates data
2pytest rule-based assertions< 1sWrong tools, missing topics, forbidden content
3DeepEval GEval LLM-as-judge~30–60sNuanced quality, accuracy, completeness

Series Complete

You’ve now built a production-grade AI agent from scratch:

ChapterWhat You Built
Ch 1Mental model: agent loop, ReAct, when to use agents
Ch 2Context window anatomy, token budget management
Ch 3LangChain + LangGraph fundamentals
Ch 4Full multi-turn agent with SQLite persistence and streaming
Ch 5AgentContext injection, NeMo Guardrails, HITL interrupts
Ch 6Four-layer memory: trim, MongoDB, FAISS RAG, system prompt
Ch 7Full Langfuse observability: traces, spans, cost tracking
Ch 8Three-stage evaluation pipeline with DeepEval GEval

The full source code for this series is organized as a single Python package. Combine these pieces and you have a system that is observable, safe, memory-aware, and measurably correct.


← Ch 7: Tracing with Langfuse | ↑ Series Overview