[Ch 2] AI Agent Components & Context Engineering
Knowing that an agent has a “loop” is not enough to build a good one. You need to understand what the system is made of — and exactly what the LLM sees at every step. This chapter covers both.
The Four Core Components
Every AI agent system, regardless of framework, is built from four components:
1. LLM Backbone
The language model is the brain. It decides what to do next: whether to call a tool, what arguments to pass, and when to stop. The quality, speed, and cost of your agent depend heavily on which model you choose.
Key decisions:
- Model choice: GPT-4o for complex reasoning; GPT-4o-mini for simpler, high-throughput tasks
- Temperature: Keep it low (0–0.3) for tool-calling agents — you want deterministic decisions, not creative variance
- Context window size: Determines how much history and retrieved content the model can “see”
2. Tools
Tools are functions the LLM can call to interact with the outside world. Common categories:
| Category | Examples |
|---|---|
| Information retrieval | Web search, vector store lookup, database query |
| Action execution | File write, API call, form submission |
| Computation | Code execution, math evaluation, data parsing |
| Communication | Send email, post to Slack, create a ticket |
A tool is just a function with a clear name, description, and typed parameters — the model reads the description to know when and how to use it.
3. Memory
Memory is how the agent maintains context across steps and sessions. There are four distinct memory types (covered in depth in Ch 6):
- In-context: The current conversation history in the message list
- Episodic: Persisted conversation state (checkpointer / DB)
- Semantic: Long-term knowledge retrieved via RAG (vector store)
- Procedural: The system prompt — “how to behave” instructions baked into every request
4. Planning
Planning is the agent’s ability to decompose a complex goal into steps. This ranges from implicit (the LLM decides step-by-step via ReAct) to explicit (a dedicated “planner” LLM call generates a structured plan that an “executor” then follows).
For most enterprise agents, the implicit ReAct planning is sufficient — and simpler to debug. Explicit multi-step planners are powerful but add complexity; reach for them only when the task requires long-horizon coordination across many sub-tasks.
Prompt Engineering vs. Context Engineering
Here’s a distinction that will save you weeks of frustration in production.
Prompt Engineering is about what words you use — how to phrase instructions to get the model to behave the way you want. It’s the craft of writing good system prompts, user messages, and few-shot examples.
Context Engineering is about what the model actually sees — managing everything that occupies the context window at each step of the agent loop.
Context Engineering is a superset of Prompt Engineering.
In a single LLM API call, you have to write a good prompt. In an agent system, you have to do that and manage every byte in the context window across potentially dozens of steps.
Anatomy of a Context Window
Every time the LLM is called, it receives a context window. In an agent loop, that window contains multiple layers:
The crucial insight: tools eat tokens even when they’re not used.
A typical tool schema (name + description + parameter definitions) costs 100–300 tokens. If your agent has 15 tools, that’s 1,500–4,500 tokens consumed before any conversation history or retrieved documents. For an agent with a 32K token budget running a long conversation, this becomes a real constraint.
Token Budget: A Practical Example
Let’s measure the real cost of each context component:
# token_budget.py
import os
import tiktoken
# GPT-4o uses the o200k_base encoding
ENCODING = tiktoken.get_encoding("o200k_base")
def count_tokens(text: str) -> int:
return len(ENCODING.encode(text))
# --- Simulate context components ---
SYSTEM_PROMPT = """You are an expert document analyst assistant. You help users
understand and extract information from complex technical documents. You are precise,
thorough, and always cite your sources. When a document is provided, read it carefully
before answering. Format your responses in clear markdown."""
TOOL_SCHEMA_EXAMPLE = """Tool: search_documents
Description: Search through the document knowledge base for relevant passages.
Parameters:
- query (string, required): The search query to find relevant document sections
- max_results (integer, optional): Maximum number of results to return (default: 5)
- document_filter (string, optional): Filter results to a specific document by name"""
# Simulate 10 back-and-forth turns
CONVERSATION_HISTORY = """
User: Can you find all mentions of the payment terms in the contract?
Assistant: [calls search_documents(query="payment terms")]
Tool Result: Found 3 sections mentioning payment terms: Section 4.2, 7.1, and Appendix B.
Assistant: I found payment terms in three locations. Section 4.2 covers...
[... 9 more turns ...]
""" * 3 # multiply to simulate a longer conversation
RETRIEVED_DOCS = """
[Retrieved chunk 1 - Section 4.2]: Payment shall be made within 30 days of invoice...
[Retrieved chunk 2 - Section 7.1]: Late payment penalties apply at 1.5% per month...
[Retrieved chunk 3 - Appendix B]: Extended payment plans available upon request...
""" * 5 # 5 retrieved chunks
CURRENT_QUERY = "What is the penalty for paying 60 days late?"
# --- Count and report ---
components = {
"System Prompt": SYSTEM_PROMPT,
"Tool Schema (x10 tools estimate)": TOOL_SCHEMA_EXAMPLE * 10,
"Conversation History (10 turns)": CONVERSATION_HISTORY,
"Retrieved Documents (5 chunks)": RETRIEVED_DOCS,
"Current User Query": CURRENT_QUERY,
}
total = 0
print(f"{'Component':<40} {'Tokens':>8} {'% of 32K':>10}")
print("-" * 62)
for name, content in components.items():
tokens = count_tokens(content)
total += tokens
print(f"{name:<40} {tokens:>8,} {tokens/32768*100:>9.1f}%")
print("-" * 62)
print(f"{'TOTAL':<40} {total:>8,} {total/32768*100:>9.1f}%")
print(f"\nRemaining budget (32K model): {max(0, 32768 - total):,} tokens")
print(f"Remaining budget (128K model): {max(0, 131072 - total):,} tokens")
Running this produces output like:
Component Tokens % of 32K
--------------------------------------------------------------
System Prompt 52 0.2%
Tool Schema (x10 tools estimate) 1,430 4.4%
Conversation History (10 turns) 1,710 5.2%
Retrieved Documents (5 chunks) 975 3.0%
Current User Query 16 0.0%
--------------------------------------------------------------
TOTAL 4,183 12.8%
Remaining budget (32K model): 28,585 tokens
Remaining budget (128K model): 126,889 tokens
Ten turns into a conversation with a moderate number of tools and RAG chunks, you’ve used only 13% of a 32K window — but that grows fast. At 30+ turns with larger documents, you’ll hit limits and need to trim.
Context Engineering Strategies
Once you understand what’s in the context window, you can manage it deliberately:
1. System Prompt — Keep It Focused
The system prompt runs on every single call. Every unnecessary sentence costs tokens across your entire agent’s lifetime. Write precisely:
# ❌ Vague, padded
BAD_SYSTEM_PROMPT = """
You are a very helpful and friendly AI assistant. You are smart and knowledgeable
about many things. You always try to be nice and helpful to users. When users ask
questions, you do your best to answer them accurately and helpfully...
"""
# ✅ Precise, structured
GOOD_SYSTEM_PROMPT = """You are a document analysis assistant.
CAPABILITIES:
- Search documents using search_documents tool
- Extract and cite specific passages
- Summarize findings in structured markdown
CONSTRAINTS:
- Only answer questions about documents in the knowledge base
- Always cite the source section when quoting
- If information is not found, say so explicitly
"""
2. Tool Schemas — Prune Unused Tools
Don’t pass every tool on every call. If your agent has a “create_document” tool but the user just asked a read-only question, remove it from the current call. In LangGraph, you can swap the tool list dynamically per node.
3. Conversation History — Trim Aggressively
Keep a sliding window of the last N turns, or summarize old turns into a single “context block.” We’ll implement this in Ch 6.
4. RAG Chunks — Quality Over Quantity
Don’t inject 20 retrieved chunks “just in case.” Use a relevance threshold: only inject chunks above a similarity score of 0.75. Three highly relevant chunks beat ten mediocre ones — and cost far less context.
The Mental Model
Think of the context window as a whiteboard that the LLM reads before every response. You control everything on that whiteboard:
- What instructions appear (system prompt)
- What tools are available (tool schemas)
- What happened before (conversation history)
- What facts are relevant (retrieved documents)
- What the user just asked (current message)
Context Engineering is the discipline of making every token on that whiteboard count.
.env.example
# .env.example
OPENAI_API_KEY=your-api-key-here
💡 Ollama note: The token counting above uses
tiktoken(OpenAI’s tokenizer). Ollama models use different tokenizers (Llama usessentencepiece). Token counts will vary, but the principles — keep system prompts tight, prune unused tools, trim history — apply universally.
Summary
| Concept | Key Takeaway |
|---|---|
| LLM Backbone | The decision-maker; choose model + temperature carefully |
| Tools | Functions the LLM can call; each schema costs tokens |
| Memory | Four types: in-context, episodic, semantic, procedural |
| Planning | Implicit (ReAct) vs. explicit; start with implicit |
| Prompt Engineering | Craft of writing good instructions and examples |
| Context Engineering | Managing everything in the context window across all steps |
| Token Budget | System prompt + tool schemas + history + RAG + query = limited budget |
In the next chapter, we get hands-on with LangChain and LangGraph — the frameworks that automate the agent loop we’ve been building mentally.
← Ch 1: Introduction to AI Agents | Ch 3: LangChain & LangGraph Intro →
