[Ch 5] RAG Integration — Giving Your Agent Real Knowledge

In Ch 4 we built a working QA agent with two stub tools that returned placeholder strings. In this chapter we replace those stubs with real implementations: search_docs retrieves relevant excerpts from project documentation using FAISS vector search, and generate_test_cases uses those retrieved excerpts to generate grounded, contextual test cases.
How RAG Works in This Agent
(Markdown files)"] --> C["Chunk & Embed
(OpenAI Embeddings)"] C --> F["💾 FAISS Index
(saved to disk)"] end subgraph "Online: Agent Query" Q["User Question"] --> S["search_docs(query, k=3)"] S --> F F --> R["Top-k Relevant Chunks"] R --> G["generate_test_cases(feature, context)"] G --> LLM["LLM generates test cases
grounded in actual docs"] end style D fill:#2563EB,color:#fff,stroke:none style F fill:#7C3AED,color:#fff,stroke:none style S fill:#059669,color:#fff,stroke:none style LLM fill:#D97706,color:#fff,stroke:none
Project Layout at This Stage
qa-agent/
├── .env.example
├── requirements.txt
├── docs/ ← NEW: project documentation to index
│ ├── requirements.md
│ ├── api-spec.md
│ └── user-stories.md
├── agent/
│ ├── state.py
│ ├── tools/ ← EVOLVED: tools.py split into a package
│ │ ├── __init__.py
│ │ ├── retrieval.py ← NEW: search_docs() via FAISS
│ │ └── testgen.py ← NEW: generate_test_cases() via LLM
│ ├── nodes.py
│ ├── graph.py
│ └── main.py
└── scripts/
└── ingest_docs.py ← NEW: embed + index project docs
Installation
pip install langchain-core langchain-openai langgraph faiss-cpu langchain-community
# .env.example
OPENAI_API_KEY=your-api-key-here
FAISS_INDEX_PATH=faiss_index
💡 Ollama: For embeddings, swap
OpenAIEmbeddingswith:from langchain_ollama import OllamaEmbeddings embeddings = OllamaEmbeddings(model="nomic-embed-text")For test case generation in
testgen.py, swapChatOpenAIwithChatOllama.
Step 1 — Create the Project Documentation
These are fictional markdown files representing a software project. In a real scenario, these would be your actual project docs, API specs, and user stories.
<!-- docs/requirements.md -->
# System Requirements
## Authentication & Login
- Users must authenticate via email and password.
- Passwords must be at least 12 characters long and contain uppercase, lowercase, digit, and special character.
- Accounts lock after 5 consecutive failed login attempts. Lockout duration: 15 minutes.
- Multi-factor authentication (MFA) is required for admin accounts.
- Session tokens expire after 8 hours of inactivity.
## Performance
- API endpoints must respond within 200ms at the 95th percentile under normal load.
- The system must support 500 concurrent users without degradation.
- File uploads are limited to 50MB per file.
- Background jobs must complete within 30 minutes.
## Data Retention
- User data is retained for 7 years per compliance requirements.
- Soft-delete is used for all user-facing records; hard delete requires admin approval.
<!-- docs/api-spec.md -->
# API Specification
## Payment API
### POST /api/v1/payments
Creates a new payment transaction.
**Request body:**
- `amount` (integer, required): Amount in cents. Must be between 1 and 1,000,000.
- `currency` (string, required): ISO 4217 currency code (e.g., "USD", "EUR").
- `payment_method_id` (string, required): Tokenized payment method from client SDK.
- `idempotency_key` (string, required): UUID, prevents duplicate charges.
**Responses:**
- `201 Created`: Payment initiated. Returns `{ "transaction_id": "...", "status": "pending" }`.
- `400 Bad Request`: Invalid amount, currency, or missing required fields.
- `402 Payment Required`: Payment method declined.
- `409 Conflict`: Duplicate idempotency key.
- `422 Unprocessable Entity`: Amount exceeds maximum.
### GET /api/v1/payments/{transaction_id}
Retrieves payment status.
## File Upload API
### POST /api/v1/uploads
- Max file size: 50MB.
- Accepted MIME types: `image/jpeg`, `image/png`, `application/pdf`.
- Returns a pre-signed URL valid for 24 hours.
- Virus scanning runs asynchronously; status transitions to `ready` or `rejected`.
<!-- docs/user-stories.md -->
# User Stories
## Admin Panel
**US-001: User Management**
As an admin, I want to view all registered users with their status and last login date,
so that I can identify inactive or suspicious accounts.
Acceptance criteria:
- Display: username, email, role, status (active/suspended), last login
- Filterable by status and role
- Exportable to CSV
**US-002: Audit Log**
As an admin, I want to view a chronological audit log of all privileged actions,
so that I can investigate incidents.
Acceptance criteria:
- Log entries: actor, action, timestamp, affected resource, IP address
- Searchable by actor and date range
- Immutable once written
## End User
**US-003: Password Reset**
As a user, I want to reset my password via email so that I can regain access if I forget it.
Acceptance criteria:
- Reset link expires after 1 hour
- Link is single-use
- User must enter new password twice to confirm
- Previous password cannot be reused
**US-004: File Upload**
As a user, I want to upload documents up to 50MB in size,
so that I can attach supporting materials to my submissions.
Acceptance criteria:
- Progress indicator during upload
- Error message if file exceeds 50MB or is an unsupported format
- Uploaded file appears in "My Documents" within 30 seconds
Step 2 — Ingest the Documents
The ingestion script walks the docs/ directory, splits each file into overlapping chunks, embeds them with OpenAI, and saves the FAISS index to disk.
# scripts/ingest_docs.py
"""
Embed and index all markdown files in the docs/ directory.
Run once before starting the agent:
python -m scripts.ingest_docs
"""
import os
from pathlib import Path
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from dotenv import load_dotenv
load_dotenv()
DOCS_DIR = Path("docs")
INDEX_PATH = os.environ.get("FAISS_INDEX_PATH", "faiss_index")
CHUNK_SIZE = 500 # characters per chunk
CHUNK_OVERLAP = 100 # overlap to avoid cutting mid-sentence
def ingest() -> None:
docs = []
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
separators=["\n\n", "\n", " "],
)
for md_file in sorted(DOCS_DIR.glob("*.md")):
print(f"Loading {md_file.name}...")
loader = TextLoader(str(md_file), encoding="utf-8")
raw_docs = loader.load()
# Attach the source filename as metadata
for doc in raw_docs:
doc.metadata["source"] = md_file.name
chunks = splitter.split_documents(raw_docs)
docs.extend(chunks)
print(f" → {len(chunks)} chunks")
print(f"\nEmbedding {len(docs)} chunks with OpenAIEmbeddings...")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(docs, embeddings)
vectorstore.save_local(INDEX_PATH)
print(f"✅ FAISS index saved to '{INDEX_PATH}/'")
if __name__ == "__main__":
ingest()
Run this once before starting the agent:
python -m scripts.ingest_docs
# Loading requirements.md...
# → 8 chunks
# Loading api-spec.md...
# → 12 chunks
# Loading user-stories.md...
# → 9 chunks
# Embedding 29 chunks with OpenAIEmbeddings...
# ✅ FAISS index saved to 'faiss_index/'
Step 3 — Implement search_docs
# agent/tools/retrieval.py
import os
from functools import lru_cache
from pydantic import BaseModel, Field
from langchain_core.tools import tool
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
INDEX_PATH = os.environ.get("FAISS_INDEX_PATH", "faiss_index")
@lru_cache(maxsize=1)
def _load_vectorstore() -> FAISS:
"""Load FAISS index once and cache it (avoids reloading on every tool call)."""
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
return FAISS.load_local(
INDEX_PATH,
embeddings,
allow_dangerous_deserialization=True, # safe: we control the index
)
class SearchDocsInput(BaseModel):
query: str = Field(description="The question or topic to search for in project documentation")
k: int = Field(default=3, description="Number of relevant document chunks to retrieve")
@tool("search_docs", args_schema=SearchDocsInput)
def search_docs(query: str, k: int = 3) -> str:
"""Search project documentation and return relevant excerpts.
Performs vector similarity search against the embedded FAISS index.
Returns the top-k most relevant chunks with their source file.
"""
vectorstore = _load_vectorstore()
results = vectorstore.similarity_search(query, k=k)
if not results:
return "No relevant documentation found for this query."
formatted = []
for i, doc in enumerate(results, 1):
source = doc.metadata.get("source", "unknown")
formatted.append(f"[{i}] Source: {source}\n{doc.page_content.strip()}")
return "\n\n---\n\n".join(formatted)
What lru_cache does here: The FAISS index is loaded from disk on the first call and cached in memory for all subsequent calls. Without this, every tool invocation would reload the index from disk — expensive for a large index.
Step 4 — Implement generate_test_cases
# agent/tools/testgen.py
import os
from pydantic import BaseModel, Field
from langchain_core.tools import tool
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI
from .retrieval import search_docs
TESTGEN_SYSTEM_PROMPT = """You are a senior QA engineer. Given a feature description and relevant
documentation excerpts, generate a concise list of test cases.
Format each test case as:
TC-XXX: [Test Case Title]
Given: [precondition]
When: [action]
Then: [expected result]
Focus on: happy path, edge cases, error cases, and boundary conditions."""
class GenerateTestCasesInput(BaseModel):
feature: str = Field(
description="The feature or component to generate test cases for (e.g., 'payment API', 'login flow')"
)
@tool("generate_test_cases", args_schema=GenerateTestCasesInput)
def generate_test_cases(feature: str) -> str:
"""Generate test cases for a feature using retrieved project documentation as context.
Automatically retrieves relevant documentation chunks via search_docs,
then uses an LLM to generate grounded test cases.
"""
# First, retrieve relevant context from the documentation
context = search_docs.invoke({"query": feature, "k": 4})
llm = ChatOpenAI(
model="gpt-4o-mini",
temperature=0.2,
api_key=os.environ.get("OPENAI_API_KEY"),
)
messages = [
SystemMessage(content=TESTGEN_SYSTEM_PROMPT),
HumanMessage(content=f"Feature: {feature}\n\nDocumentation context:\n{context}"),
]
response = llm.invoke(messages)
return response.content
Why call search_docs inside generate_test_cases? This is a tool composition pattern: generate_test_cases calls search_docs internally to retrieve context before asking the LLM to generate. The agent doesn’t need to chain these calls explicitly — it just calls generate_test_cases and gets fully-contextualized results back.
Step 5 — Wire the Tools Package
# agent/tools/__init__.py
from .retrieval import search_docs
from .testgen import generate_test_cases
TOOLS = [search_docs, generate_test_cases]
Update the import in nodes.py:
# agent/nodes.py (changed import only)
from .tools import TOOLS # was: from .tools import TOOLS (single file)
Update the graph to import from the package:
# agent/graph.py (changed import only)
from .tools import TOOLS
No changes needed to state.py, graph.py logic, or main.py — the agent loop is identical. Only the tool implementations change.
Running the Agent
With the FAISS index ingested, start the agent:
# agent/main.py (unchanged from Ch 4)
from dotenv import load_dotenv
from langchain_core.messages import HumanMessage
from .graph import build_graph
load_dotenv()
def chat(app, thread_id: str, user_input: str) -> None:
config = {"configurable": {"thread_id": thread_id}}
inputs = {"messages": [HumanMessage(content=user_input)]}
print(f"\nUser: {user_input}")
print("Assistant: ", end="", flush=True)
for chunk, metadata in app.stream(inputs, config=config, stream_mode="messages"):
if (
chunk.content
and metadata.get("langgraph_node") == "agent"
and not getattr(chunk, "tool_calls", None)
):
print(chunk.content, end="", flush=True)
print()
def main():
app = build_graph()
thread = "qa-demo"
chat(app, thread, "What are the login requirements?")
chat(app, thread, "Generate test cases for the login flow")
chat(app, thread, "What does the payment API return when a payment is declined?")
chat(app, thread, "Generate edge case tests for the payment API")
if __name__ == "__main__":
main()
Expected output (abbreviated):
User: What are the login requirements?
Assistant: Based on the project documentation, here are the login requirements:
- Users authenticate via email and password
- Passwords must be at least 12 characters with uppercase, lowercase, digit, and special character
- Accounts lock after 5 failed attempts (15-minute lockout)
- MFA is required for admin accounts
- Sessions expire after 8 hours of inactivity
User: Generate test cases for the login flow
Assistant: Here are test cases for the login flow:
TC-001: Successful Login with Valid Credentials
Given: A registered user with a valid email and compliant password
When: User submits correct credentials
Then: Session token is issued and user is redirected to dashboard
TC-002: Account Lockout After 5 Failed Attempts
Given: A registered user account
When: User enters incorrect password 5 consecutive times
Then: Account is locked for 15 minutes and lockout message is displayed
TC-003: Password Complexity Validation
Given: A new user attempting to set a password
When: Password is submitted without a special character
Then: Error message shown listing all complexity requirements
TC-004: MFA Required for Admin Accounts
Given: An admin-role user
When: User authenticates with email/password
Then: MFA challenge is presented before access is granted
TC-005: Session Expiry After Inactivity
Given: A logged-in user session
When: User is inactive for 8+ hours
Then: Session is invalidated and user is redirected to login
User: What does the payment API return when a payment is declined?
Assistant: When a payment is declined, the payment API returns:
- Status: 402 Payment Required
- This indicates the payment method was declined by the payment processor
Key Design Points
| Decision | Rationale |
|---|---|
lru_cache on FAISS load | Avoids reloading the index on every tool call in a single process |
chunk_size=500, overlap=100 | Balances retrieval precision vs. context completeness |
text-embedding-3-small | 1536-dim embeddings, strong quality/cost tradeoff for English docs |
Tool composition in generate_test_cases | Agent calls one tool; retrieval happens internally |
Separate ingest_docs.py script | Ingestion is a one-time offline step, not part of the agent loop |
Summary
| Component | What Changed from Ch 4 |
|---|---|
agent/tools/retrieval.py | search_docs now does real FAISS similarity search |
agent/tools/testgen.py | generate_test_cases retrieves context then calls LLM |
agent/tools/__init__.py | New package entry point |
scripts/ingest_docs.py | New: offline doc ingestion script |
docs/*.md | New: project documentation files |
agent/nodes.py, graph.py, main.py | Unchanged — agent loop is identical |
In the next chapter, we add production safety: user context injection via AgentContext, content moderation with NeMo Guardrails, and human-in-the-loop interrupts for high-stakes decisions.
← Ch 4: Build Your First Agent | Ch 6: Tools, Guardrails & Safety →
