Hey everyone, Alex here from agntai.net. It’s May 11th, 2026, and I’ve been wrestling with something that I think a lot of you working with AI agents are probably feeling right now: the sheer weight of context windows. Specifically, I’m talking about how we can build agents that are genuinely smart and effective without constantly hitting token limits or paying an arm and a leg for every interaction. Forget the hype about “infinite context” – that’s a mirage. We need practical strategies, and that’s what I want to dig into today.
My focus for this article is on an architecture I’ve been refining for building more capable AI agents, particularly those that need to perform complex, multi-step tasks over extended periods. I call it the “Hierarchical Context Manager” (HCM) architecture. It’s not about inventing a new LLM; it’s about how we *use* the ones we have, smarter.
The Context Conundrum: My Personal Battle
Let me set the scene. A few months ago, I was working on an agent for a client. This agent’s job was to assist junior analysts in synthesizing information from multiple internal reports, external news feeds, and competitor analyses. The goal was to generate a concise, actionable summary and identify key trends. Sounds straightforward, right?
Initially, I tried the naive approach: dump all relevant documents into the prompt, ask the LLM . Predictably, it failed spectacularly. Token limits were a constant headache. Even when I chunked documents, the agent struggled to maintain coherence across different information sources. It would summarize one part beautifully, then totally miss connections to another. The cost per interaction was also spiraling. I was basically paying for the LLM to re-read everything every time it needed to make a decision or generate an output.
This frustration led me down a rabbit hole. I realized the problem wasn’t the LLM’s intelligence in isolation, but its working memory – its context window. It’s like asking a human to write a detailed report after only allowing them to read one page at a time, then making them forget everything they just read before moving to the next page. Absurd! We needed a system that mimicked how humans manage information: abstracting, remembering key points, and only diving into details when necessary.
Introducing the Hierarchical Context Manager (HCM) Architecture
The core idea behind HCM is to create a tiered system of information management for your agent. Instead of a flat, monolithic context, we organize it into levels of abstraction and relevance. This allows the agent to maintain a high-level understanding of its task and accumulated knowledge, while still being able to pull in specific details when required, without overwhelming the LLM’s context window.
Level 1: The Ephemeral Scratchpad (Short-Term Memory)
This is where the agent does its immediate thinking. It’s the equivalent of a human’s working memory. When the agent receives a new prompt, or is in the middle of a step in a multi-step task, relevant information for *that specific step* is loaded here. This could be the current sub-goal, the output of the previous step, or a small chunk of data it just retrieved.
The key here is that this context is highly dynamic and transient. It’s purpose-built for the current LLM call. Once the LLM has processed this information and produced an output, this scratchpad is largely cleared, with only the most crucial elements being promoted to a higher level of memory.
Think of it as the input to a single function call. It’s focused and small.
Level 2: The Working Log (Intermediate-Term Memory)
This level stores a structured, chronological log of the agent’s recent activities, decisions, and observations. It’s more persistent than the scratchpad but still focused on the current “session” or task execution. Instead of raw LLM outputs, we store summarized versions, key facts extracted, or decisions made.
For my analyst agent, this might include:
- “Step 1 completed: Identified 3 relevant internal reports.”
- “Decision: Prioritize reports from Q4 2025 due to recent market shift.”
- “Observation: Competitor X’s Q1 2026 earnings showed unexpected growth in AI investments.”
This log acts as a rolling summary of progress. When the agent needs to decide its next step, or respond to a new user query, it can consult this log without needing to re-read every single raw piece of data it has processed.
I typically implement this as a simple list of dictionaries or a small database table, where each entry is a summary generated by a smaller, cheaper LLM call, or even a rule-based system, based on the output of the Ephemeral Scratchpad.
Level 3: The Knowledge Base (Long-Term Memory)
This is where the “heavy lifting” of information storage happens. It’s a persistent, queryable repository of all the knowledge the agent has access to or has accumulated over time. This includes:
- Original documents (reports, articles, emails)
- Summaries of these documents (generated by the agent itself)
- Extracted entities, facts, and relationships
- User preferences or historical interactions
- Domain-specific rules or heuristics
The Knowledge Base isn’t directly loaded into the LLM’s context. Instead, it’s accessed via retrieval augmented generation (RAG). When the agent at Level 1 or 2 determines it needs specific information, it formulates a query, sends it to the Knowledge Base, and retrieves only the most relevant chunks. This is crucial for keeping context windows manageable.
For my client’s agent, this would be where all the internal reports, competitor analyses, and news feeds live. When the agent needed to find “details on Competitor Y’s market share in Q1 2026”, it wouldn’t load *all* competitor data; it would query the KB for that specific piece of information.
How the HCM Architecture Works in Practice
Let’s trace a typical interaction with an agent built using HCM:
-
Initial Prompt: User asks, “Summarize market trends for Q1 2026 for our AI investments and identify key competitors.”
This prompt goes into the Ephemeral Scratchpad, along with the current high-level goal.
-
Planning (LLM Call 1): The LLM, given the prompt and a small amount of high-level context from the Working Log (e.g., “current task is market analysis”), decides on a plan:
- Retrieve Q1 2026 market reports.
- Identify key AI investment areas.
- Identify top competitors in those areas.
- Synthesize findings.
This plan is recorded in the Working Log (summarized).
-
Information Retrieval (Knowledge Base Query): The agent executes the first step: “Retrieve Q1 2026 market reports.” It formulates a query for the Knowledge Base (e.g., “documents related to ‘Q1 2026 market trends AI'”).
The Knowledge Base returns relevant document chunks. These chunks are *not* immediately loaded into the main LLM context.
-
Processing Retrieved Data (LLM Call 2 – N): For each retrieved chunk, the agent might perform a smaller, focused LLM call (using the Ephemeral Scratchpad) to:
- Extract key facts.
- Summarize the chunk.
- Identify entities (companies, technologies).
These extracted facts/summaries are then added to the Working Log (or potentially directly back into the Knowledge Base as summarized versions).
Example: A document chunk mentions “XYZ Corp investing heavily in neuromorphic chips.” This fact is extracted and added to the Working Log.
-
Synthesis and Decision Making (LLM Call N+1): Once enough information has been processed and logged in the Working Log, the agent’s main LLM (with Ephemeral Scratchpad containing the current sub-goal and a condensed summary from the Working Log) is prompted to synthesize the findings or decide the next step.
The prompt might look something like: “Given the following observations from the Working Log: [list of summarized facts], what are the key AI investment trends for Q1 2026? What competitors are prominent?”
- Final Output: The LLM generates the final summary and identifies competitors. This is presented to the user, and a summary of the entire interaction is stored in the Knowledge Base for future reference.
Practical Implementation Details and Code Snippets
Implementing HCM doesn’t require exotic libraries, just careful thought about data flow. Here’s a simplified Python example of how you might structure parts of this:
1. The Ephemeral Scratchpad (Python Dictionary)
class EphemeralScratchpad:
def __init__(self):
self.context = {}
def add_item(self, key, value):
self.context[key] = value
def get_context_string(self):
# Format for LLM prompt
return "\n".join([f"{k}: {v}" for k, v in self.context.items()])
def clear(self):
self.context = {}
# Usage example:
scratchpad = EphemeralScratchpad()
scratchpad.add_item("current_task", "Identify competitor X's Q1 2026 AI strategy.")
scratchpad.add_item("previous_output", "Competitor X's earnings report shows increased R&D spend.")
# LLM call will use scratchpad.get_context_string()
2. The Working Log (Simple List of Dicts, or SQLite)
import sqlite3
import json
class WorkingLog:
def __init__(self, db_path='agent_log.db'):
self.conn = sqlite3.connect(db_path)
self.cursor = self.conn.cursor()
self._create_table()
def _create_table(self):
self.cursor.execute('''
CREATE TABLE IF NOT EXISTS log_entries (
id INTEGER PRIMARY KEY,
timestamp TEXT,
event_type TEXT,
summary TEXT,
details TEXT
)
''')
self.conn.commit()
def add_entry(self, event_type, summary, details=None):
timestamp = datetime.now().isoformat()
details_json = json.dumps(details) if details else None
self.cursor.execute('''
INSERT INTO log_entries (timestamp, event_type, summary, details)
VALUES (?, ?, ?, ?)
''', (timestamp, event_type, summary, details_json))
self.conn.commit()
def get_recent_summaries(self, limit=5):
# Retrieve the most recent N summaries for LLM context
self.cursor.execute('''
SELECT summary FROM log_entries ORDER BY timestamp DESC LIMIT ?
''', (limit,))
return [row[0] for row in self.cursor.fetchall()]
def close(self):
self.conn.close()
# Usage example:
log = WorkingLog()
log.add_entry("PLAN", "Decided to first retrieve Q1 2026 market reports.", {"plan_steps": ["retrieve reports", "analyze trends"]})
log.add_entry("OBSERVATION", "Found a key report indicating 15% growth in AI infrastructure.", {"report_id": "RPT-001"})
# For an LLM call needing recent context:
recent_log_entries = log.get_recent_summaries(3)
# -> ["Found a key report indicating...", "Decided to first retrieve...", ...]
3. The Knowledge Base (Vector Database + Text Chunks)
This is where tools like ChromaDB, Pinecone, or even a simple FAISS index come in. I won’t provide a full vector DB setup here, as it’s a topic in itself, but the concept is to embed your document chunks and query them semantically.
# Conceptual example for Knowledge Base interaction
from some_vector_db_library import VectorDBClient # e.g., Chroma, Pinecone
from some_embedding_model import get_embedding # e.g., OpenAI, Sentence Transformers
class KnowledgeBase:
def __init__(self, db_client):
self.db_client = db_client # Assumes an initialized vector DB client
def add_document(self, doc_id, text_content):
# Chunk text, embed chunks, store in vector DB
chunks = self._chunk_text(text_content)
embeddings = [get_embedding(chunk) for chunk in chunks]
self.db_client.add(ids=[f"{doc_id}-{i}" for i in range(len(chunks))],
embeddings=embeddings,
metadatas=[{"doc_id": doc_id, "chunk_idx": i} for i in range(len(chunks))],
documents=chunks)
def query_knowledge(self, query_string, top_k=5):
query_embedding = get_embedding(query_string)
results = self.db_client.query(query_embeddings=[query_embedding], n_results=top_k)
# Process results to return relevant text chunks
return results['documents'][0] # List of relevant text strings
def _chunk_text(self, text):
# Simple chunking for demonstration
# In reality, use more sophisticated chunking strategies (recursive, semantic)
return [text[i:i+500] for i in range(0, len(text), 500)]
# Usage example:
# kb_client = VectorDBClient(...) # Initialize your vector DB
# kb = KnowledgeBase(kb_client)
# kb.add_document("Q1_Report_AI", "Full text of Q1 AI market report...")
# relevant_info = kb.query_knowledge("What were the key AI investment areas in Q1 2026?")
# -> ["Chunk 1 from report...", "Chunk 2 from report...", ...]
Challenges and Considerations
- Orchestration Complexity: Managing these layers requires a good orchestrator (your agent’s main loop). It needs to know when to query the KB, when to update the Working Log, and what to put into the Ephemeral Scratchpad for each LLM call.
- Summarization Quality: The effectiveness of the Working Log depends heavily on how well you summarize information before storing it. Bad summaries lead to bad decisions. You might need fine-tuned smaller LLMs or clever prompt engineering for this.
- Retrieval Accuracy: Your RAG system for the Knowledge Base needs to be good. If it retrieves irrelevant information, the agent will still be confused or miss crucial details. Experiment with different embedding models, chunking strategies, and similarity metrics.
- Cost Management: While HCM aims to reduce costs, there’s still a balance. Many small LLM calls (for summarization, extraction) can add up. Choose appropriate models (e.g., cheaper, smaller models for simple summarization; larger, more capable models for complex reasoning).
Actionable Takeaways for Your Next Agent Project
If you’re building an AI agent that needs to handle more than a single turn of interaction or deal with significant amounts of information, I strongly recommend adopting a tiered context management strategy like HCM. Here’s what you can do:
- Map Your Agent’s “Thinking Process”: Before writing code, outline the steps your agent needs to take for its primary tasks. Where does it need immediate info? Where does it need to remember past steps? Where does it need access to a large corpus of data?
- Implement a “Working Log” First: This is arguably the most impactful first step. A simple list of summarized observations and decisions can dramatically improve an agent’s ability to maintain coherence over multiple steps without blowing up your main context window.
- Adopt RAG for Long-Term Data: Don’t try to cram entire documents into your LLM. Build a robust retrieval system for your Knowledge Base. This will be a continuous effort of refining chunking, embedding, and querying.
- Think About Abstraction: Train or prompt your LLM to produce concise summaries or extracted facts rather than just raw outputs. This is how you distill information for higher-level memory layers.
- Monitor Token Usage and Cost: Keep an eye on your LLM API calls. HCM aims to reduce average token usage per call, leading to more efficient and cheaper operations. If you see spikes, investigate which layer is causing it and if information can be further abstracted.
The journey to building truly capable AI agents is less about finding the “perfect” LLM and more about engineering intelligent systems around them. The Hierarchical Context Manager architecture is my current best attempt at tackling the context problem head-on, giving our agents the kind of memory and information management skills they need to perform in the real world. Give it a shot, and let me know what you find!
🕒 Published: