My Take: Mastering State in Multi-Agent Systems

📖 10 min read•1,940 words•Updated May 10, 2026

Alright, folks, Alex here, back at agntai.net, and today we’re diving headfirst into something that’s been buzzing in my own dev environment for weeks: the surprisingly complex dance of state management in multi-agent systems. Not just any multi-agent systems, mind you, but those where agents are actually *doing things* in the real world, or at least in a simulated one that mimics reality pretty closely. We’re talking about agents that need to remember more than just their last LLM call; they need to remember what they *did*, what *happened* because of it, and what that means for their next action. It’s a mess, and it’s glorious.

My recent obsession started with a simple idea: build a team of “digital assistants” for a small, fictional e-commerce store. One agent handles customer support (parsing emails, looking up orders), another manages inventory (checking stock levels, flagging reorders), and a third handles marketing (drafting social media posts based on new products or sales). Sounds simple, right? Just hook ’em up to an LLM, give ’em some tools, and let ’em rip. Oh, my sweet summer child, I was so naive.

The State of My Own Sanity: Why Simple State Isn’t Enough

Initially, I thought, “Each agent has its own memory, right? Just a conversational buffer, maybe a scratchpad for tools.” And for a single agent doing a single task, that’s often fine. The customer support agent can remember the current customer’s query. The inventory agent knows what it just reordered. But what happens when the customer support agent promises a refund, and the inventory agent needs to know that a specific item is now considered “returned” and shouldn’t be counted as available? Or when the marketing agent drafts a “New Product X!” post, but the inventory agent just flagged Product X as out of stock?

Suddenly, my agents weren’t just conversing; they were *interacting* with a shared, dynamic world. And that shared world needed a shared understanding, a shared *state*. This isn’t just about passing messages back and forth; it’s about a consistent, verifiable record of what’s true in the system at any given moment.

The Pitfalls of Ad-Hoc State Sharing

My first attempt was, predictably, a disaster. I tried to just have agents “tell” each other things. The customer support agent, after processing a refund, would just send a message like “Refund processed for Order #123, Item SKU: XYZ.” The inventory agent would then, in theory, pick that up and update its internal representation. This led to:

Race Conditions Galore: What if the marketing agent checked stock *before* the inventory agent processed the refund message? Outdated information.
Lost Messages: What if an agent was busy or crashed? The message was just… gone.
Conflicting Information: Agent A says X is true, Agent B says Y is true, and both rely on old data.
Debugging Nightmares: Trying to trace why a specific piece of information was wrong was like trying to find a needle in a haystack made of LLM outputs.

It quickly became clear that the “just tell each other” approach was a recipe for chaos. These weren’t just LLM calls; these were *transactions*. They had consequences, and those consequences needed to be reflected consistently across the system.

Enter the Shared, Event-Driven World State

This is where I started thinking about a more centralized, but still decoupled, approach. The core idea is to treat the system’s “truth” as a series of events that happen in the world. Agents don’t just “tell” each other things; they *publish* events about changes they’ve observed or actions they’ve taken. Other agents *subscribe* to these events if they care about them.

Think of it like a news ticker for your agent system. When something important happens – an order is placed, an item is refunded, stock levels change – it gets broadcast. Agents who are “tuned in” to that specific news channel react accordingly.

Practical Example: The Order State Machine

Let’s take the e-commerce example. Instead of agents directly modifying shared data structures (which is a big no-no in distributed systems anyway), I introduced a central “Order State” service. This service is the single source of truth for all order-related data. Agents interact with it through defined APIs or by publishing events that *trigger* state changes in this service.

Here’s a simplified breakdown of how an order might flow:


# Simplified Event Definitions (Python dataclasses or Pydantic models)

@dataclass
class OrderPlacedEvent:
 order_id: str
 customer_id: str
 items: List[dict] # e.g., [{"sku": "XYZ", "qty": 1}]
 timestamp: datetime

@dataclass
class RefundRequestedEvent:
 order_id: str
 item_sku: str
 refund_amount: float
 timestamp: datetime

@dataclass
class InventoryUpdatedEvent:
 sku: str
 new_stock_level: int
 timestamp: datetime

# Imagine a central "Event Bus" or message queue (e.g., Kafka, RabbitMQ, or even a simple Redis Pub/Sub)
# Agents publish events to this bus.
# Other services/agents subscribe to relevant events.

# Example: Customer Support Agent publishes a refund request
class CustomerSupportAgent:
 def __init__(self, event_bus):
 self.event_bus = event_bus

 def process_refund_request(self, order_id, item_sku, amount):
 # ... logic to confirm refund eligibility ...
 refund_event = RefundRequestedEvent(
 order_id=order_id, 
 item_sku=item_sku, 
 refund_amount=amount, 
 timestamp=datetime.now()
 )
 self.event_bus.publish("order_events", refund_event.model_dump())
 print(f"Agent published RefundRequestedEvent for Order {order_id}")

# Example: Inventory Service (not necessarily an agent, but a backend service)
# subscribes to refund events to update stock.
class InventoryService:
 def __init__(self, event_bus):
 self.event_bus = event_bus
 self.event_bus.subscribe("order_events", self.handle_order_event)

 def handle_order_event(self, event_data):
 event_type = event_data.get("type") # Assuming event_data has a 'type' field
 if event_type == "RefundRequestedEvent":
 refund_event = RefundRequestedEvent(**event_data)
 # ... logic to update inventory for refunded item ...
 print(f"Inventory Service received RefundRequestedEvent for Order {refund_event.order_id}")
 # Potentially publish an InventoryUpdatedEvent
 self.event_bus.publish("inventory_events", 
 InventoryUpdatedEvent(sku=refund_event.item_sku, 
 new_stock_level=current_stock + 1, 
 timestamp=datetime.now()).model_dump())

# This setup decouples the agents significantly.
# The customer support agent doesn't need to know *how* inventory is updated,
# only that a refund event needs to be broadcast.

The key here is that events are immutable records of *what happened*. They are facts. Services or agents can then react to these facts. This gives us:

Auditability: Every change in the system is an event. You can replay the history of the system.
Decoupling: Agents don’t need to know about the internal workings of other agents; they just need to know about the events they care about.
Resilience: If an agent goes down, the events are still on the bus. When it comes back up, it can process them.
Consistency: The “Order State” service, by being the single writer for order states, ensures consistency.

Agents as State Transformers

My agents, in this model, became less about direct action and more about observing the world (via events), reasoning about it, and then proposing *new events* or *actions* that, when processed, would change the world state. For example:

Customer Support Agent: Observes a CustomerQueryEvent, uses its LLM and tools to determine a refund is needed, then publishes a RefundRequestedEvent.
Inventory Agent: Observes a RefundProcessedEvent (published by the Inventory Service after processing the RefundRequestedEvent), updates its internal view of inventory, and might then publish an InventoryUpdatedEvent if stock changes significantly, or even a LowStockAlertEvent if a threshold is crossed.
Marketing Agent: Subscribes to NewProductEvent and LowStockAlertEvent. If a new product arrives, it drafts a social media post. If a product goes low stock, it might pause promotions for that item.

This is a subtle but powerful shift. The agents aren’t directly modifying the inventory database; they’re proposing changes that are then validated and applied by dedicated services (or other agents with specific permissions) which then publish events reflecting the new state.

The Challenge: LLM Context and State

One of the biggest headaches I ran into was managing the LLM’s context within this event-driven architecture. An LLM’s context window is finite, and just feeding it *all* the events is a non-starter. My solution involved a two-pronged approach:

Agent-Specific Knowledge Bases: Each agent maintains its own summarized, relevant knowledge base derived from events. For instance, the customer support agent might have a “Customer History” that summarizes past interactions and order details relevant to a specific customer, rather than replaying every single order event. This often involves embedding and retrieval.
Selective Event Feeding: When an agent needs to make a decision, it doesn’t get *all* events. It gets the immediate triggering event (e.g., a new customer query) and then, based on its internal reasoning, *retrieves* relevant past events or summarized state from its knowledge base.

Here’s a conceptual snippet for how an agent might fetch relevant context:


# Conceptual - assumes a vector database or similar for knowledge retrieval

class SmartAgent:
 def __init__(self, event_bus, knowledge_store):
 self.event_bus = event_bus
 self.knowledge_store = knowledge_store # e.g., a ChromaDB client
 self.event_bus.subscribe("agent_triggers", self.handle_trigger_event)

 def handle_trigger_event(self, trigger_event_data):
 # ... parse trigger_event_data ...
 customer_id = trigger_event_data.get("customer_id")
 query_text = trigger_event_data.get("query")

 # Retrieve relevant past interactions/order summaries for this customer
 # This is where the magic of RAG comes in.
 relevant_context = self.knowledge_store.query(
 query_texts=[query_text],
 n_results=5,
 where_filter={"customer_id": customer_id}
 )
 
 # Format the retrieved context for the LLM
 context_for_llm = self.format_context_for_llm(relevant_context)
 
 # Now, combine the current query and the retrieved context for the LLM call
 llm_prompt = f"Customer Query: {query_text}\n\nRelevant History:\n{context_for_llm}\n\nYour task..."
 
 # ... call LLM, get response, publish new events ...

 def format_context_for_llm(self, retrieved_docs):
 # Simple formatting, could be much more sophisticated
 formatted = ""
 for doc in retrieved_docs:
 formatted += f"- {doc['content']}\n" # Assuming 'content' field in retrieved docs
 return formatted

This way, the LLM isn’t drowning in data, and the agents maintain a focused, relevant understanding of the world state without needing to store everything in their immediate working memory.

Actionable Takeaways for Your Own Agent Architectures

If you’re building multi-agent systems that need to interact with a dynamic environment, don’t make my initial mistakes. Here’s what I’ve learned:

Embrace Event-Driven Architecture: Treat changes in your system as immutable events. Use a message queue or event bus. This provides auditability, resilience, and decoupling.
Define Clear State Ownership: Identify single sources of truth for critical data (e.g., “Order State,” “Inventory Levels”). Agents should request changes or publish events, not directly modify these states without validation.
Agents as State Observers/Proposers: Your agents should primarily observe events, reason, and then propose new actions or events that, when processed by the appropriate service/agent, lead to state changes.
Smart Context Management for LLMs: Don’t dump everything into the LLM’s context. Implement retrieval-augmented generation (RAG) strategies to pull *relevant* historical events or summarized state for each LLM call.
Think Transactions, Not Just Conversations: If an agent’s action has real-world consequences (even simulated ones), treat it as a transaction that needs to be consistent and verifiable.
Start Simple, Iterate: My initial simple approach failed, but it taught me *why* I needed a more robust system. Don’t over-engineer from day one, but be prepared to refactor your state management as complexity grows.

Building these systems is still more art than science, but moving from a “tell-and-hope” state management strategy to a robust, event-driven approach has been a game-changer for my multi-agent endeavors. It’s more work upfront, for sure, but the headache it saves you down the line is absolutely worth it. Happy building!

🕒 Published: May 10, 2026

🧬

Written by Jake Chen

Deep tech researcher specializing in LLM architectures, agent reasoning, and autonomous systems. MS in Computer Science.

Learn more →