[Ch 2] AI Agent Components & Context Engineering

Knowing that an agent has a “loop” is not enough to build a good one. You need to understand what the system is made of — and exactly what the LLM sees at every step. This chapter covers both.

The Four Core Components

Every AI agent system, regardless of framework, is built from four components:

graph LR A[🧠 LLM Backbone] --- B[🔧 Tools] A --- C[🗃️ Memory] A --- D[🗺️ Planning] style A fill:#7C3AED,color:#fff,stroke:none style B fill:#2563EB,color:#fff,stroke:none style C fill:#059669,color:#fff,stroke:none style D fill:#D97706,color:#fff,stroke:none

Fig 1: The four pillars of an AI agent

1. LLM Backbone

The language model is the brain. It decides what to do next: whether to call a tool, what arguments to pass, and when to stop. The quality, speed, and cost of your agent depend heavily on which model you choose.

Key decisions:

Model choice: GPT-4o for complex reasoning; GPT-4o-mini for simpler, high-throughput tasks
Temperature: Keep it low (0–0.3) for tool-calling agents — you want deterministic decisions, not creative variance
Context window size: Determines how much history and retrieved content the model can “see”

2. Tools

Tools are functions the LLM can call to interact with the outside world. Common categories:

Category	Examples
Information retrieval	Web search, vector store lookup, database query
Action execution	File write, API call, form submission
Computation	Code execution, math evaluation, data parsing
Communication	Send email, post to Slack, create a ticket

A tool is just a function with a clear name, description, and typed parameters — the model reads the description to know when and how to use it.

3. Memory

Memory is how the agent maintains context across steps and sessions. There are four distinct memory types (covered in depth in Ch 6):

In-context: The current conversation history in the message list
Episodic: Persisted conversation state (checkpointer / DB)
Semantic: Long-term knowledge retrieved via RAG (vector store)
Procedural: The system prompt — “how to behave” instructions baked into every request

4. Planning

Planning is the agent’s ability to decompose a complex goal into steps. This ranges from implicit (the LLM decides step-by-step via ReAct) to explicit (a dedicated “planner” LLM call generates a structured plan that an “executor” then follows).

For most enterprise agents, the implicit ReAct planning is sufficient — and simpler to debug. Explicit multi-step planners are powerful but add complexity; reach for them only when the task requires long-horizon coordination across many sub-tasks.

Prompt Engineering vs. Context Engineering

Here’s a distinction that will save you weeks of frustration in production.

Prompt Engineering is about what words you use — how to phrase instructions to get the model to behave the way you want. It’s the craft of writing good system prompts, user messages, and few-shot examples.

Context Engineering is about what the model actually sees — managing everything that occupies the context window at each step of the agent loop.

Context Engineering is a superset of Prompt Engineering.

In a single LLM API call, you have to write a good prompt. In an agent system, you have to do that and manage every byte in the context window across potentially dozens of steps.

Anatomy of a Context Window

Every time the LLM is called, it receives a context window. In an agent loop, that window contains multiple layers:

block-beta columns 1 block:context["Context Window (e.g. 128,000 tokens)"] A["📋 System Prompt\n(instructions, persona, output format)"] B["🔧 Tool Schemas\n(name, description, parameters for each tool)"] C["💬 Conversation History\n(all prior messages + tool results)"] D["📚 Retrieved Documents\n(RAG chunks injected for this turn)"] E["❓ Current User Message"] end style A fill:#7C3AED,color:#fff style B fill:#2563EB,color:#fff style C fill:#059669,color:#fff style D fill:#D97706,color:#fff style E fill:#DC2626,color:#fff

Fig 2: Everything competing for space in the context window

The crucial insight: tools eat tokens even when they’re not used.

A typical tool schema (name + description + parameter definitions) costs 100–300 tokens. If your agent has 15 tools, that’s 1,500–4,500 tokens consumed before any conversation history or retrieved documents. For an agent with a 32K token budget running a long conversation, this becomes a real constraint.

Token Budget: A Practical Example

Let’s measure the real cost of each context component:

# token_budget.py
import os
import tiktoken

# GPT-4o uses the o200k_base encoding
ENCODING = tiktoken.get_encoding("o200k_base")

def count_tokens(text: str) -> int:
    return len(ENCODING.encode(text))

# --- Simulate context components ---

SYSTEM_PROMPT = """You are an expert document analyst assistant. You help users 
understand and extract information from complex technical documents. You are precise, 
thorough, and always cite your sources. When a document is provided, read it carefully 
before answering. Format your responses in clear markdown."""

TOOL_SCHEMA_EXAMPLE = """Tool: search_documents
Description: Search through the document knowledge base for relevant passages.
Parameters:
  - query (string, required): The search query to find relevant document sections
  - max_results (integer, optional): Maximum number of results to return (default: 5)
  - document_filter (string, optional): Filter results to a specific document by name"""

# Simulate 10 back-and-forth turns
CONVERSATION_HISTORY = """
User: Can you find all mentions of the payment terms in the contract?
Assistant: [calls search_documents(query="payment terms")]
Tool Result: Found 3 sections mentioning payment terms: Section 4.2, 7.1, and Appendix B.
Assistant: I found payment terms in three locations. Section 4.2 covers...
[... 9 more turns ...]
""" * 3  # multiply to simulate a longer conversation

RETRIEVED_DOCS = """
[Retrieved chunk 1 - Section 4.2]: Payment shall be made within 30 days of invoice...
[Retrieved chunk 2 - Section 7.1]: Late payment penalties apply at 1.5% per month...
[Retrieved chunk 3 - Appendix B]: Extended payment plans available upon request...
""" * 5  # 5 retrieved chunks

CURRENT_QUERY = "What is the penalty for paying 60 days late?"

# --- Count and report ---
components = {
    "System Prompt": SYSTEM_PROMPT,
    "Tool Schema (x10 tools estimate)": TOOL_SCHEMA_EXAMPLE * 10,
    "Conversation History (10 turns)": CONVERSATION_HISTORY,
    "Retrieved Documents (5 chunks)": RETRIEVED_DOCS,
    "Current User Query": CURRENT_QUERY,
}

total = 0
print(f"{'Component':<40} {'Tokens':>8}  {'% of 32K':>10}")
print("-" * 62)
for name, content in components.items():
    tokens = count_tokens(content)
    total += tokens
    print(f"{name:<40} {tokens:>8,}  {tokens/32768*100:>9.1f}%")

print("-" * 62)
print(f"{'TOTAL':<40} {total:>8,}  {total/32768*100:>9.1f}%")
print(f"\nRemaining budget (32K model): {max(0, 32768 - total):,} tokens")
print(f"Remaining budget (128K model): {max(0, 131072 - total):,} tokens")

Running this produces output like:

Component                                   Tokens  % of 32K
--------------------------------------------------------------
System Prompt                                   52      0.2%
Tool Schema (x10 tools estimate)             1,430      4.4%
Conversation History (10 turns)              1,710      5.2%
Retrieved Documents (5 chunks)                 975      3.0%
Current User Query                              16      0.0%
--------------------------------------------------------------
TOTAL                                        4,183     12.8%

Remaining budget (32K model): 28,585 tokens
Remaining budget (128K model): 126,889 tokens

Ten turns into a conversation with a moderate number of tools and RAG chunks, you’ve used only 13% of a 32K window — but that grows fast. At 30+ turns with larger documents, you’ll hit limits and need to trim.

Context Engineering Strategies

Once you understand what’s in the context window, you can manage it deliberately:

1. System Prompt — Keep It Focused

The system prompt runs on every single call. Every unnecessary sentence costs tokens across your entire agent’s lifetime. Write precisely:

# ❌ Vague, padded
BAD_SYSTEM_PROMPT = """
You are a very helpful and friendly AI assistant. You are smart and knowledgeable 
about many things. You always try to be nice and helpful to users. When users ask 
questions, you do your best to answer them accurately and helpfully...
"""

# ✅ Precise, structured
GOOD_SYSTEM_PROMPT = """You are a document analysis assistant.

CAPABILITIES:
- Search documents using search_documents tool
- Extract and cite specific passages
- Summarize findings in structured markdown

CONSTRAINTS:
- Only answer questions about documents in the knowledge base
- Always cite the source section when quoting
- If information is not found, say so explicitly
"""

2. Tool Schemas — Prune Unused Tools

Don’t pass every tool on every call. If your agent has a “create_document” tool but the user just asked a read-only question, remove it from the current call. In LangGraph, you can swap the tool list dynamically per node.

3. Conversation History — Trim Aggressively

Keep a sliding window of the last N turns, or summarize old turns into a single “context block.” We’ll implement this in Ch 6.

4. RAG Chunks — Quality Over Quantity

Don’t inject 20 retrieved chunks “just in case.” Use a relevance threshold: only inject chunks above a similarity score of 0.75. Three highly relevant chunks beat ten mediocre ones — and cost far less context.

The Mental Model

Think of the context window as a whiteboard that the LLM reads before every response. You control everything on that whiteboard:

What instructions appear (system prompt)
What tools are available (tool schemas)
What happened before (conversation history)
What facts are relevant (retrieved documents)
What the user just asked (current message)

Context Engineering is the discipline of making every token on that whiteboard count.

.env.example

# .env.example
OPENAI_API_KEY=your-api-key-here

💡 Ollama note: The token counting above uses tiktoken (OpenAI’s tokenizer). Ollama models use different tokenizers (Llama uses sentencepiece). Token counts will vary, but the principles — keep system prompts tight, prune unused tools, trim history — apply universally.

Summary

Concept	Key Takeaway
LLM Backbone	The decision-maker; choose model + temperature carefully
Tools	Functions the LLM can call; each schema costs tokens
Memory	Four types: in-context, episodic, semantic, procedural
Planning	Implicit (ReAct) vs. explicit; start with implicit
Prompt Engineering	Craft of writing good instructions and examples
Context Engineering	Managing everything in the context window across all steps
Token Budget	System prompt + tool schemas + history + RAG + query = limited budget

In the next chapter, we get hands-on with LangChain and LangGraph — the frameworks that automate the agent loop we’ve been building mentally.

← Ch 1: Introduction to AI Agents | Ch 3: LangChain & LangGraph Intro →

No results found