[Ch 5] RAG Integration — Giving Your Agent Real Knowledge

Apr 8, 2026 · 10 min read
blog AI Agent

In Ch 4 we built a working QA agent with two stub tools that returned placeholder strings. In this chapter we replace those stubs with real implementations: search_docs retrieves relevant excerpts from project documentation using FAISS vector search, and generate_test_cases uses those retrieved excerpts to generate grounded, contextual test cases.


How RAG Works in This Agent

graph TD subgraph "Offline: Document Ingestion" D["📄 Project Docs
(Markdown files)"] --> C["Chunk & Embed
(OpenAI Embeddings)"] C --> F["💾 FAISS Index
(saved to disk)"] end subgraph "Online: Agent Query" Q["User Question"] --> S["search_docs(query, k=3)"] S --> F F --> R["Top-k Relevant Chunks"] R --> G["generate_test_cases(feature, context)"] G --> LLM["LLM generates test cases
grounded in actual docs"] end style D fill:#2563EB,color:#fff,stroke:none style F fill:#7C3AED,color:#fff,stroke:none style S fill:#059669,color:#fff,stroke:none style LLM fill:#D97706,color:#fff,stroke:none
Fig 1: Offline ingestion builds the FAISS index; online queries retrieve relevant chunks and ground LLM outputs

Project Layout at This Stage

qa-agent/
├── .env.example
├── requirements.txt
├── docs/                 ← NEW: project documentation to index
│   ├── requirements.md
│   ├── api-spec.md
│   └── user-stories.md
├── agent/
│   ├── state.py
│   ├── tools/            ← EVOLVED: tools.py split into a package
│   │   ├── __init__.py
│   │   ├── retrieval.py  ← NEW: search_docs() via FAISS
│   │   └── testgen.py    ← NEW: generate_test_cases() via LLM
│   ├── nodes.py
│   ├── graph.py
│   └── main.py
└── scripts/
    └── ingest_docs.py    ← NEW: embed + index project docs

Installation

pip install langchain-core langchain-openai langgraph faiss-cpu langchain-community
# .env.example
OPENAI_API_KEY=your-api-key-here
FAISS_INDEX_PATH=faiss_index

💡 Ollama: For embeddings, swap OpenAIEmbeddings with:

from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")

For test case generation in testgen.py, swap ChatOpenAI with ChatOllama.


Step 1 — Create the Project Documentation

These are fictional markdown files representing a software project. In a real scenario, these would be your actual project docs, API specs, and user stories.

<!-- docs/requirements.md -->
# System Requirements

## Authentication & Login
- Users must authenticate via email and password.
- Passwords must be at least 12 characters long and contain uppercase, lowercase, digit, and special character.
- Accounts lock after 5 consecutive failed login attempts. Lockout duration: 15 minutes.
- Multi-factor authentication (MFA) is required for admin accounts.
- Session tokens expire after 8 hours of inactivity.

## Performance
- API endpoints must respond within 200ms at the 95th percentile under normal load.
- The system must support 500 concurrent users without degradation.
- File uploads are limited to 50MB per file.
- Background jobs must complete within 30 minutes.

## Data Retention
- User data is retained for 7 years per compliance requirements.
- Soft-delete is used for all user-facing records; hard delete requires admin approval.
<!-- docs/api-spec.md -->
# API Specification

## Payment API

### POST /api/v1/payments
Creates a new payment transaction.

**Request body:**
- `amount` (integer, required): Amount in cents. Must be between 1 and 1,000,000.
- `currency` (string, required): ISO 4217 currency code (e.g., "USD", "EUR").
- `payment_method_id` (string, required): Tokenized payment method from client SDK.
- `idempotency_key` (string, required): UUID, prevents duplicate charges.

**Responses:**
- `201 Created`: Payment initiated. Returns `{ "transaction_id": "...", "status": "pending" }`.
- `400 Bad Request`: Invalid amount, currency, or missing required fields.
- `402 Payment Required`: Payment method declined.
- `409 Conflict`: Duplicate idempotency key.
- `422 Unprocessable Entity`: Amount exceeds maximum.

### GET /api/v1/payments/{transaction_id}
Retrieves payment status.

## File Upload API

### POST /api/v1/uploads
- Max file size: 50MB.
- Accepted MIME types: `image/jpeg`, `image/png`, `application/pdf`.
- Returns a pre-signed URL valid for 24 hours.
- Virus scanning runs asynchronously; status transitions to `ready` or `rejected`.
<!-- docs/user-stories.md -->
# User Stories

## Admin Panel

**US-001: User Management**
As an admin, I want to view all registered users with their status and last login date,
so that I can identify inactive or suspicious accounts.
Acceptance criteria:
- Display: username, email, role, status (active/suspended), last login
- Filterable by status and role
- Exportable to CSV

**US-002: Audit Log**
As an admin, I want to view a chronological audit log of all privileged actions,
so that I can investigate incidents.
Acceptance criteria:
- Log entries: actor, action, timestamp, affected resource, IP address
- Searchable by actor and date range
- Immutable once written

## End User

**US-003: Password Reset**
As a user, I want to reset my password via email so that I can regain access if I forget it.
Acceptance criteria:
- Reset link expires after 1 hour
- Link is single-use
- User must enter new password twice to confirm
- Previous password cannot be reused

**US-004: File Upload**
As a user, I want to upload documents up to 50MB in size,
so that I can attach supporting materials to my submissions.
Acceptance criteria:
- Progress indicator during upload
- Error message if file exceeds 50MB or is an unsupported format
- Uploaded file appears in "My Documents" within 30 seconds

Step 2 — Ingest the Documents

The ingestion script walks the docs/ directory, splits each file into overlapping chunks, embeds them with OpenAI, and saves the FAISS index to disk.

# scripts/ingest_docs.py
"""
Embed and index all markdown files in the docs/ directory.
Run once before starting the agent:
    python -m scripts.ingest_docs
"""
import os
from pathlib import Path
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from dotenv import load_dotenv

load_dotenv()

DOCS_DIR    = Path("docs")
INDEX_PATH  = os.environ.get("FAISS_INDEX_PATH", "faiss_index")
CHUNK_SIZE  = 500        # characters per chunk
CHUNK_OVERLAP = 100      # overlap to avoid cutting mid-sentence


def ingest() -> None:
    docs = []
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        separators=["\n\n", "\n", " "],
    )

    for md_file in sorted(DOCS_DIR.glob("*.md")):
        print(f"Loading {md_file.name}...")
        loader = TextLoader(str(md_file), encoding="utf-8")
        raw_docs = loader.load()
        # Attach the source filename as metadata
        for doc in raw_docs:
            doc.metadata["source"] = md_file.name
        chunks = splitter.split_documents(raw_docs)
        docs.extend(chunks)
        print(f"  → {len(chunks)} chunks")

    print(f"\nEmbedding {len(docs)} chunks with OpenAIEmbeddings...")
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vectorstore = FAISS.from_documents(docs, embeddings)

    vectorstore.save_local(INDEX_PATH)
    print(f"✅ FAISS index saved to '{INDEX_PATH}/'")


if __name__ == "__main__":
    ingest()

Run this once before starting the agent:

python -m scripts.ingest_docs
# Loading requirements.md...
#   → 8 chunks
# Loading api-spec.md...
#   → 12 chunks
# Loading user-stories.md...
#   → 9 chunks
# Embedding 29 chunks with OpenAIEmbeddings...
# ✅ FAISS index saved to 'faiss_index/'

Step 3 — Implement search_docs

# agent/tools/retrieval.py
import os
from functools import lru_cache
from pydantic import BaseModel, Field
from langchain_core.tools import tool
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

INDEX_PATH = os.environ.get("FAISS_INDEX_PATH", "faiss_index")


@lru_cache(maxsize=1)
def _load_vectorstore() -> FAISS:
    """Load FAISS index once and cache it (avoids reloading on every tool call)."""
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    return FAISS.load_local(
        INDEX_PATH,
        embeddings,
        allow_dangerous_deserialization=True,   # safe: we control the index
    )


class SearchDocsInput(BaseModel):
    query: str = Field(description="The question or topic to search for in project documentation")
    k: int = Field(default=3, description="Number of relevant document chunks to retrieve")


@tool("search_docs", args_schema=SearchDocsInput)
def search_docs(query: str, k: int = 3) -> str:
    """Search project documentation and return relevant excerpts.

    Performs vector similarity search against the embedded FAISS index.
    Returns the top-k most relevant chunks with their source file.
    """
    vectorstore = _load_vectorstore()
    results = vectorstore.similarity_search(query, k=k)

    if not results:
        return "No relevant documentation found for this query."

    formatted = []
    for i, doc in enumerate(results, 1):
        source = doc.metadata.get("source", "unknown")
        formatted.append(f"[{i}] Source: {source}\n{doc.page_content.strip()}")

    return "\n\n---\n\n".join(formatted)

What lru_cache does here: The FAISS index is loaded from disk on the first call and cached in memory for all subsequent calls. Without this, every tool invocation would reload the index from disk — expensive for a large index.


Step 4 — Implement generate_test_cases

# agent/tools/testgen.py
import os
from pydantic import BaseModel, Field
from langchain_core.tools import tool
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI
from .retrieval import search_docs

TESTGEN_SYSTEM_PROMPT = """You are a senior QA engineer. Given a feature description and relevant 
documentation excerpts, generate a concise list of test cases.

Format each test case as:
TC-XXX: [Test Case Title]
  Given: [precondition]
  When: [action]
  Then: [expected result]

Focus on: happy path, edge cases, error cases, and boundary conditions."""


class GenerateTestCasesInput(BaseModel):
    feature: str = Field(
        description="The feature or component to generate test cases for (e.g., 'payment API', 'login flow')"
    )


@tool("generate_test_cases", args_schema=GenerateTestCasesInput)
def generate_test_cases(feature: str) -> str:
    """Generate test cases for a feature using retrieved project documentation as context.

    Automatically retrieves relevant documentation chunks via search_docs,
    then uses an LLM to generate grounded test cases.
    """
    # First, retrieve relevant context from the documentation
    context = search_docs.invoke({"query": feature, "k": 4})

    llm = ChatOpenAI(
        model="gpt-4o-mini",
        temperature=0.2,
        api_key=os.environ.get("OPENAI_API_KEY"),
    )

    messages = [
        SystemMessage(content=TESTGEN_SYSTEM_PROMPT),
        HumanMessage(content=f"Feature: {feature}\n\nDocumentation context:\n{context}"),
    ]

    response = llm.invoke(messages)
    return response.content

Why call search_docs inside generate_test_cases? This is a tool composition pattern: generate_test_cases calls search_docs internally to retrieve context before asking the LLM to generate. The agent doesn’t need to chain these calls explicitly — it just calls generate_test_cases and gets fully-contextualized results back.


Step 5 — Wire the Tools Package

# agent/tools/__init__.py
from .retrieval import search_docs
from .testgen import generate_test_cases

TOOLS = [search_docs, generate_test_cases]

Update the import in nodes.py:

# agent/nodes.py  (changed import only)
from .tools import TOOLS   # was: from .tools import TOOLS (single file)

Update the graph to import from the package:

# agent/graph.py  (changed import only)
from .tools import TOOLS

No changes needed to state.py, graph.py logic, or main.py — the agent loop is identical. Only the tool implementations change.


Running the Agent

With the FAISS index ingested, start the agent:

# agent/main.py (unchanged from Ch 4)
from dotenv import load_dotenv
from langchain_core.messages import HumanMessage
from .graph import build_graph

load_dotenv()


def chat(app, thread_id: str, user_input: str) -> None:
    config = {"configurable": {"thread_id": thread_id}}
    inputs = {"messages": [HumanMessage(content=user_input)]}

    print(f"\nUser: {user_input}")
    print("Assistant: ", end="", flush=True)
    for chunk, metadata in app.stream(inputs, config=config, stream_mode="messages"):
        if (
            chunk.content
            and metadata.get("langgraph_node") == "agent"
            and not getattr(chunk, "tool_calls", None)
        ):
            print(chunk.content, end="", flush=True)
    print()


def main():
    app = build_graph()
    thread = "qa-demo"

    chat(app, thread, "What are the login requirements?")
    chat(app, thread, "Generate test cases for the login flow")
    chat(app, thread, "What does the payment API return when a payment is declined?")
    chat(app, thread, "Generate edge case tests for the payment API")


if __name__ == "__main__":
    main()

Expected output (abbreviated):

User: What are the login requirements?
Assistant: Based on the project documentation, here are the login requirements:
- Users authenticate via email and password
- Passwords must be at least 12 characters with uppercase, lowercase, digit, and special character
- Accounts lock after 5 failed attempts (15-minute lockout)
- MFA is required for admin accounts
- Sessions expire after 8 hours of inactivity

User: Generate test cases for the login flow
Assistant: Here are test cases for the login flow:

TC-001: Successful Login with Valid Credentials
  Given: A registered user with a valid email and compliant password
  When: User submits correct credentials
  Then: Session token is issued and user is redirected to dashboard

TC-002: Account Lockout After 5 Failed Attempts
  Given: A registered user account
  When: User enters incorrect password 5 consecutive times
  Then: Account is locked for 15 minutes and lockout message is displayed

TC-003: Password Complexity Validation
  Given: A new user attempting to set a password
  When: Password is submitted without a special character
  Then: Error message shown listing all complexity requirements

TC-004: MFA Required for Admin Accounts
  Given: An admin-role user
  When: User authenticates with email/password
  Then: MFA challenge is presented before access is granted

TC-005: Session Expiry After Inactivity
  Given: A logged-in user session
  When: User is inactive for 8+ hours
  Then: Session is invalidated and user is redirected to login

User: What does the payment API return when a payment is declined?
Assistant: When a payment is declined, the payment API returns:
- Status: 402 Payment Required
- This indicates the payment method was declined by the payment processor

Key Design Points

DecisionRationale
lru_cache on FAISS loadAvoids reloading the index on every tool call in a single process
chunk_size=500, overlap=100Balances retrieval precision vs. context completeness
text-embedding-3-small1536-dim embeddings, strong quality/cost tradeoff for English docs
Tool composition in generate_test_casesAgent calls one tool; retrieval happens internally
Separate ingest_docs.py scriptIngestion is a one-time offline step, not part of the agent loop

Summary

ComponentWhat Changed from Ch 4
agent/tools/retrieval.pysearch_docs now does real FAISS similarity search
agent/tools/testgen.pygenerate_test_cases retrieves context then calls LLM
agent/tools/__init__.pyNew package entry point
scripts/ingest_docs.pyNew: offline doc ingestion script
docs/*.mdNew: project documentation files
agent/nodes.py, graph.py, main.pyUnchanged — agent loop is identical

In the next chapter, we add production safety: user context injection via AgentContext, content moderation with NeMo Guardrails, and human-in-the-loop interrupts for high-stakes decisions.


← Ch 4: Build Your First Agent | Ch 6: Tools, Guardrails & Safety →