1. Positioning Your Expertise: The AI Pyramid

To understand where modern AI engineering sits today, it helps to visualize the ecosystem as a layered pyramid of capability. Each level represents a deeper degree of control over AI systems.

AI Literate

At the base of the pyramid are users who interact with AI tools but do not build them. These users leverage systems like ChatGPT, Claude, or Gemini to improve productivity. Their focus is prompt engineering and workflow integration rather than software architecture.

Typical activities include:

Writing prompts to summarize documents
Generating code snippets
Automating repetitive writing tasks
Using AI copilots inside IDEs or productivity tools

Although this layer requires minimal programming knowledge, prompt design still plays a role. Good prompts control tone, structure, and constraints, while poor prompts produce vague or hallucinated outputs. However, AI Literate users do not control:

Model behavior beyond prompting
Data pipelines
Tool integrations
Memory systems

They operate strictly at the interface level.

AI Enabled

This is the most important layer for modern software engineers. AI Enabled developers build applications on top of existing models using APIs, SDKs, and orchestration frameworks. They do not train models, but they design the systems around them.

Key responsibilities at this layer include:

Integrating LLM APIs
Designing agent workflows
Implementing retrieval systems
Managing prompts and tool use
Building evaluation and monitoring pipelines

Typical tools include:

OpenAI or Anthropic APIs
LangChain / LlamaIndex
Vector databases
Prompt templates
Observability platforms

The skillset required is closer to backend engineering than to machine learning research. Engineers must think about:

latency
scalability
token cost
prompt reliability
structured outputs
security of private data

This layer is where Agentic RAG systems are built.

AI Native

At the top of the pyramid are the engineers building the models themselves.

These researchers work on:

training large transformer models
reinforcement learning
model architecture design
fine-tuning pipelines
distributed GPU training

Their work involves tools such as:

PyTorch
CUDA
DeepSpeed
distributed training clusters

Tasks at this layer include:

training foundation models
optimizing inference
quantization
distillation
architectural innovation

Most software engineers will never need to operate at this level. The complexity and infrastructure costs are extremely high.

Instead, the majority of production AI systems will continue to be built by AI Enabled developers.

2. Defining the "Agent" Equation

An AI agent is not simply a chatbot interface.

It is a system that combines reasoning with the ability to take actions in an environment.

A useful way to conceptualize this is through a simple equation:

Agent = LLM + Actions + Context + Memory

Each component plays a distinct architectural role.

LLM (The Brain)

The LLM is responsible for reasoning, language understanding, and decision-making.

Modern models such as GPT-4o, Claude Sonnet, or open-source models like Llama 3 act as probabilistic reasoning engines. They do not "know" facts in a deterministic way. Instead, they generate the most statistically likely sequence of tokens based on input context.

In an agent architecture, the LLM is responsible for:

interpreting user requests
deciding when to call tools
synthesizing responses
performing chain-of-thought reasoning

However, the LLM alone has severe limitations:

knowledge cutoff
hallucination risks
no access to private data
no persistent state
no ability to execute code

These limitations are solved by integrating the other components of the agent equation.

Actions (Tools)

Tools allow the agent to interact with external systems.

Without tools, an LLM can only generate text. With tools, it can interact with the world.

Examples of tools include:

API calls
database queries
code execution
web search
vector retrieval
file system operations

In LangChain, tools are typically implemented as structured functions.

Example conceptual tool definition:

search_docs(query: string) -> List[Document]

The LLM decides when to call the tool and what input to provide.

Tools dramatically expand the capabilities of an agent. Instead of answering from internal model knowledge, the agent can retrieve real information. This transforms the system from a static chatbot into an interactive reasoning system.

Context

Context defines the behavioral boundaries of the agent.

It usually consists of:

system prompts
instructions
role definitions
operational rules

Example instructions might include:

"You are a financial analyst."
"Only answer using the provided documents."
"If information is missing, say you do not know."

Context also includes:

retrieved documents
conversation history
tool outputs

LLMs operate within a limited context window. Therefore, context management becomes a critical engineering task.

Developers must decide:

which information to include
which information to remove
how to summarize long conversations

Poor context design leads to degraded reasoning performance.

Memory

Memory allows the agent to persist knowledge across interactions.

Without memory, every request becomes a stateless query.

There are two main categories of memory.

Short-Term Memory

This is usually the conversation history stored in the prompt.

Example:

User: What is RAG?
Agent: RAG stands for Retrieval Augmented Generation.
User: How does it work?

The model understands "How does it work?" because the previous conversation is still in context. However, this memory disappears when:

the session ends
the server restarts
the context window overflows

Long-Term Memory

Long-term memory requires external storage. Common approaches include:

Redis (Cache Memory)
PostgreSQL (SQL DB)
MongoDB (NoSQL DB)
vector databases

Long-term memory enables:

user personalization
task continuation
persistent knowledge accumulation

For production agents, long-term memory is essential.

3. Architecture: The Two-Phase Workflow

Agentic RAG systems are typically built around two decoupled pipelines:

Ingestion Pipeline
Execution Pipeline

Separating these pipelines improves scalability and maintainability.

Phase I: The Ingestion Pipeline

The ingestion pipeline prepares private data so it can be retrieved by the agent.

The goal is to transform human-readable text into vector representations that can be searched mathematically.

PDF Parsing

Most enterprise data exists in unstructured formats such as:

PDFs
Word documents
emails
reports

The first step is extracting clean text.

Libraries often used include:

PDF parsing tools
document loaders
OCR pipelines for scanned documents

Challenges include:

inconsistent formatting
embedded tables
broken line structures
scanned images

Preprocessing steps may involve:

removing headers and footers
normalizing whitespace
reconstructing paragraphs

Recursive Chunking

LLMs cannot process entire documents efficiently. Instead, documents are divided into smaller chunks. Chunking balances two competing goals:

preserving semantic meaning
staying within token limits

A common chunk size is 1000 characters. This size works well because it is large enough to preserve meaning but small enough for embedding models. Recursive chunking ensures that:

paragraphs stay intact
sentences are not broken unnecessarily

The Overlap Strategy

Chunk boundaries often cut off sentences or ideas. To avoid losing context, chunks overlap. Example:

Chunk 1: characters 0–1000
Chunk 2: characters 800–1800

Overlap size in this implementation is 200 characters. This ensures that concepts appearing near boundaries remain searchable. Without overlap, important context may be lost.

Embedding Generation

Each chunk is converted into a vector embedding. Embeddings represent semantic meaning in a high-dimensional space.

For example "The capital of France is Paris", might become a vector like [0.12, -0.44, 0.89, ...]. Similar sentences generate vectors that are close together in vector space. This allows the system to perform semantic similarity search instead of keyword matching. Common embedding models include:

OpenAI embedding models
Cohere embeddings
open-source models

In this system, the embedding model used is llama-text-embed-v2.

Vector Upserting

Once embeddings are generated, they must be stored in a vector database. The database indexes vectors so they can be searched efficiently. In this architecture Pinecone is used. Each record stored in the vector database includes:

embedding vector
document text
metadata
document ID

Example metadata:

{ source: "research_paper.pdf",
  page: 12

Batch Upload Strategy

Vector databases often have API limits. To optimize ingestion throughput, embeddings are uploaded in batches. Recommended batch size 96 chunks. Batching improves:

network efficiency
ingestion speed
API stability

Phase II: The Reasoning Loop (Execution)

Traditional RAG pipelines follow a simple linear flow:

User Query → Retrieve Documents → Generate Answer

Agentic RAG replaces this with an iterative reasoning loop based on the ReAct pattern.

ReAct stands for: Reasoning + Acting

Step 1: Reasoning

The agent first analyzes the user query. Example:

"Explain the difference between RAG and Agentic RAG."

The LLM determines:

whether it already knows the answer
whether external information is required
which tools should be used

Step 2: Action

If external data is needed, the agent invokes a tool.

Example tool:

similarity_search(query)

The query is embedded and sent to the vector database.

Step 3: Retrieval

The vector database performs a similarity search. Top results returned : Top K = 10. These results contain the most semantically relevant chunks.

Step 4: Observation

The agent receives the retrieved documents. It evaluates:

whether the documents answer the question
whether additional retrieval is needed

This step allows multi-hop reasoning.

Step 5: Response Generation

Finally, the agent synthesizes the retrieved information into a response. The response is grounded in retrieved data, reducing hallucinations.

4. The Technical Stack

A production agentic system requires several specialized components.

Orchestrator: LangChain

LangChain coordinates the reasoning loop. It manages:

prompt templates
tool invocation
conversation memory
agent workflows

LangChain effectively acts as the control plane of the agent.

Vector Database: Pinecone

Vector databases enable efficient similarity search. Their main responsibilities include:

storing embeddings
indexing vectors
returning nearest neighbors

Performance metrics include:

query latency
recall accuracy
scalability

Observability: LangSmith

Observability is critical for debugging agent behavior.

LangSmith allows developers to track:

token usage
request latency
tool calls
reasoning chains

Without observability, debugging agents becomes extremely difficult.

Tool Schemas: Zod

Agents need structured tool definitions. Zod is used to enforce strict schemas.

Example schema:

query: string

This tells the LLM exactly what input format the tool expects. Schema validation prevents malformed tool calls.

5. Lessons from the Field: Memory and Efficiency

One of the most interesting behaviors in agent systems is context reuse. During testing, the agent was asked the same question twice.

First query:

Agent → calls vector search

Second query:

Agent → skips search

The reason is simple. The answer already exists in the conversation history. This demonstrates how memory can significantly reduce:

token usage
database queries
latency

However, conversation memory alone is insufficient for real systems.

The Need for Persistent Memory

Without external storage, agent memory is ephemeral. It disappears when:

sessions expire
containers restart
servers redeploy

Production systems therefore integrate memory databases. Common options include:

PostgreSQL
Redis
vector-based memory stores

Persistent memory enables:

long-term user profiles
historical context
adaptive agent behavior

6. The Future: MCP and Standardized Agent Interfaces

The current agent ecosystem is fragmented. Each framework implements its own tool integration system. A new emerging standard aims to fix this problem.

The Model Context Protocol (MCP).

MCP defines a universal interface between:

LLMs
tools
data systems

Instead of writing custom integrations for each API, MCP allows tools to expose standardized capabilities. Examples include:

file systems
knowledge bases
databases
development environments

This could eventually allow agents to interact with software ecosystems in a far more modular way. Rather than tightly coupled pipelines, future systems may consist of plug-and-play AI components. If you'd like, I can also help you expand this article further with sections like:

Failure modes of Agentic RAG
Latency optimization strategies
Evaluation frameworks for AI agents
Cost optimization techniques
Multi-agent architectures

Conclusion: From Tools to Autonomous Systems

Artificial intelligence is rapidly shifting from a tool-based paradigm to a system-based paradigm. Early adoption focused on interacting with AI through prompts and interfaces. Today, the frontier lies in building agentic architectures that combine reasoning, memory, and external capabilities to perform complex tasks autonomously.

Understanding the AI Pyramid helps developers position themselves within this evolving ecosystem. While AI Literate users focus on productivity and AI Native researchers push the boundaries of model development, the most transformative work today happens in the AI Enabled layer where engineers design systems that orchestrate LLMs, tools, and data.

Agentic RAG represents one of the most powerful architectural patterns emerging from this shift. By combining retrieval pipelines, reasoning loops, and structured tool execution, developers can create systems that are not only informative but adaptive, context-aware, and capable of acting.

However, building production-grade agents requires more than simply connecting APIs. Engineers must carefully design:

ingestion pipelines for knowledge grounding
reasoning workflows for tool selection
memory architectures for persistence
observability systems for debugging and cost control

As standards like the Model Context Protocol (MCP) mature and agent frameworks evolve, we are likely moving toward a future where AI systems behave less like static models and more like software entities capable of interacting with entire digital ecosystems.

For modern developers, the challenge is no longer learning how to use AI—but learning how to architect intelligent systems around it. Those who master this shift will define the next generation of software.

SL#44 - Building Agentic RAG Systems: Architecture, Reasoning Loops, and Production Considerations