1
AI for Professionals / Applying Nine Optimization Techniques for Improved AI Memory Performance
« on: Yesterday at 10:16:40 AM »
One way to optimize an AI agent is to design its architecture with multiple sub-agents to improve accuracy. However, in conversational AI, optimization doesn’t stop there memory becomes even more crucial.
As your conversation with the AI agent gets longer and deeper, it uses more memory.
This is due to components like previous context storage, tool calling, database searches, and other dependencies your AI agent relies on.
In this blog, we will code and evaluate 9 beginner-to-advanced memory optimization techniques for AI agents.
You will learn how to apply each technique, along with their advantages and drawbacks from simple sequential approaches to advanced, OS-like memory management implementations.
Summary about Techniques
To keep things clear and practical, we will use a simple AI agent throughout the blog. This will help us observe the internal mechanics of each technique and make it easier to scale and implement these strategies in more complex systems.
All the code (theory + notebook) is available in my GitHub repo:
Setting up the Environment
To optimize and test different memory techniques for AI agents, we need to initialize several components before starting the evaluation. But before initializing, we first need to install the necessary Python libraries.
We will need:
openai: The client library for interacting with the LLM API.
numpy: For numerical operations, especially with embeddings.
faiss-cpu: A library from Facebook AI for efficient similarity search, which will power our retrieval memory. It's a perfect in-memory vector database.
networkx: For creating and managing the knowledge graph in our Graph-Based Memory strategy.
tiktoken: To accurately count tokens and manage context window limits.
Let’s install these modules.
pip install openai numpy faiss-cpu networkx tiktoken
Now we need to initialize the client module, which will be used to make LLM calls. Let’s do that.
import os
from openai import OpenAI
API_KEY = "YOUR_LLM_API_KEY"
BASE_URL = "https://api.studio.nebius.com/v1/"
client = OpenAI(
base_url=BASE_URL,
api_key=API_KEY
)
print("OpenAI client configured successfully.")
We will be using open-source models through an API provider such as Bnebius or Together AI. Next, we need to import and decide which open-source LLM will be used to create our AI agent.
import tiktoken
import time
GENERATION_MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"
EMBEDDING_MODEL = "BAAI/bge-multilingual-gemma2"
For the main tasks, we are using the LLaMA 3.1 8B Instruct model. Some of the optimizations depend on an embedding model, for which we will be using the Gemma-2-BGE multimodal embedding model.
Next, we need to define multiple helpers that will be used throughout this blog.
Creating Helper Functions
To avoid repetitive code and follow good coding practices, we will define three helper functions that will be used throughout this guide:
generate_text: Generates content based on the system and user prompts passed to the LLM.
generate_embeddings: Generates embeddings for retrieval-based approaches.
count_tokens: Counts the total number of tokens for each retrieval-based approach.
Let’s start by coding the first function, generate_text, which will generate text based on the given input prompt.
def generate_text(system_prompt: str, user_prompt: str) -> str:
"""
Calls the LLM API to generate a text response.
Args:
system_prompt: The instruction that defines the AI's role and behavior.
user_prompt: The user's input to which the AI should respond.
Returns:
The generated text content from the AI, or an error message.
"""
response = client.chat.completions.create(
model=GENERATION_MODEL,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
]
)
return response.choices[0].message.content
Our generate_text function takes two inputs: a system prompt and a user prompt. Based on our text generation model, LLaMA 3.1 8B, it generates a response using the client module.
Next, let’s code the generate_embeddings function. We have chosen the Gemma-2 model for this purpose, and we will use the same client module to generate embeddings.
def generate_embedding(text: str) -> list[float]:
"""
Generates a numerical embedding for a given text string using the embedding model.
Args:
text: The input string to be converted into an embedding.
Returns:
A list of floats representing the embedding vector, or an empty list on error.
"""
response = client.embeddings.create(
model=EMBEDDING_MODEL,
input=text
)
return response.data[0].embedding
Our embedding function returns the embedding of the given input text using the selected Gemma-2 model.
Now, we need one more function that will count tokens based on the entire AI and user chat history. This helps us understand the overall flow and how it has been optimized.
We will use the most common and modern tokenizer used in many LLM architectures, OpenAI cl100k_base, which is a Byte Pair Encoding (BPE) tokenizer.
BPE, in simpler terms, is a tokenization algorithm that efficiently splits text into sub-word units.
"lower", "lowest" → ["low", "er"], ["low", "est"]
So let’s initialize the tokenizer using the tiktoken module:
tokenizer = tiktoken.get_encoding("cl100k_base")
We can now create a function to tokenize the text and count the total number of tokens.
def count_tokens(text: str) -> int:
"""
Counts the number of tokens in a given string using the pre-loaded tokenizer.
Args:
text: The string to be tokenized.
Returns:
The integer count of tokens.
"""
return len(tokenizer.encode(text))
Great! Now that we have created all the helper functions, we can start exploring different techniques to learn and evaluate them.
Creating Foundational Agent and Memory Class
Now we need to create the core design structure of our agent so that it can be used throughout the guide. Regarding memory, there are three important components that play a key role in any AI agent:
Adding past messages to the AI agent’s memory to make the agent aware of the context.
Retrieving relevant content that helps the AI agent generate responses.
Clearing the AI agent’s memory after each strategy has been implemented.
Object-Oriented Programming (OOP) is the best way to build this memory-based feature, so let’s create that.
class BaseMemoryStrategy(abc.ABC):
"""Abstract Base Class for all memory strategies."""
@abc.abstractmethod
def add_message(self, user_input: str, ai_response: str):
"""
An abstract method that must be implemented by subclasses.
It's responsible for adding a new user-AI interaction to the memory store.
"""
pass
@abc.abstractmethod
def get_context(self, query: str) -> str:
"""
An abstract method that must be implemented by subclasses.
It retrieves and formats the relevant context from memory to be sent to the LLM.
The 'query' parameter allows some strategies (like retrieval) to fetch context
that is specifically relevant to the user's latest input.
"""
pass
@abc.abstractmethod
def clear(self):
"""
An abstract method that must be implemented by subclasses.
It provides a way to reset the memory, which is useful for starting new conversations.
"""
pass
We are using @abstractmethod, which is a common coding style when subclasses are reused with different implementations. In our case, each strategy (which is a subclass) includes a different kind of implementation, so it is necessary to use abstract methods in the design.
Now, based on the memory state we recently defined and the helper functions we’ve created, we can build our AI agent structure using OOP principles. Let’s code that and then understand the process.
class AIAgent:
"""The main AI Agent class, designed to work with any memory strategy."""
def __init__(self, memory_strategy: BaseMemoryStrategy, system_prompt: str = "You are a helpful AI assistant."):
"""
Initializes the agent.
Args:
memory_strategy: An instance of a class that inherits from BaseMemoryStrategy.
This determines how the agent will remember the conversation.
system_prompt: The initial instruction given to the LLM to define its persona and task.
"""
self.memory = memory_strategy
self.system_prompt = system_prompt
print(f"Agent initialized with {type(memory_strategy).__name__}.")
def chat(self, user_input: str):
"""
Handles a single turn of the conversation.
Args:
user_input: The latest message from the user.
"""
print(f"\n{'='*25} NEW INTERACTION {'='*25}")
print(f"User > {user_input}")
start_time = time.time()
context = self.memory.get_context(query=user_input)
retrieval_time = time.time() - start_time
full_user_prompt = f"### MEMORY CONTEXT\n{context}\n\n### CURRENT REQUEST\n{user_input}"
prompt_tokens = count_tokens(self.system_prompt + full_user_prompt)
print("\n--- Agent Debug Info ---")
print(f"Memory Retrieval Time: {retrieval_time:.4f} seconds")
print(f"Estimated Prompt Tokens: {prompt_tokens}")
print(f"\n[Full Prompt Sent to LLM]:\n---\nSYSTEM: {self.system_prompt}\nUSER: {full_user_prompt}\n---")
start_time = time.time()
ai_response = generate_text(self.system_prompt, full_user_prompt)
generation_time = time.time() - start_time
self.memory.add_message(user_input, ai_response)
print(f"\nAgent > {ai_response}")
print(f"(LLM Generation Time: {generation_time:.4f} seconds)")
print(f"{'='*70}")
So, our agent is based on 6 simple steps.
First it retrieves the context from memory based on the strategy we will be using, during the process how much time it takes and so.
Then it merges the retrieved memory context with the current user input, preparing it as a complete prompt for the LLM.
Then it prints some debug info, things like how many tokens the prompt might use and how long context retrieval took.
Then it sends the full prompt (system + user + context) to the LLM and waits for a response.
Then it updates the memory with this new interaction, so it’s available for future context.
And finally, it shows the AI’s response along with how long it took to generate, wrapping up this turn of the conversation.
Great! Now that we have coded every component, we can start understanding and implementing each of the memory optimization techniques.
Problem with Sequential Optimization Approach
The very first optimization approach is the most basic and simplest, commonly used by many developers. It was one of the earliest methods to manage conversation history, often used by early chatbots.
This method involves adding each new message to a running log and feeding the entire conversation back to the model every time. It creates a linear chain of memory, preserving everything that has been said so far. Let’s visualize it.
Sequential Approach
Sequential approach works like this …
User starts a conversation with the AI agent.
The agent responds.
This user-AI interaction (a “turn”) is saved as a single block of text.
For the next turn, the agent takes the entire history (Turn 1 + Turn 2 + Turn 3…) and combines it with the new user query.
This massive block of text is sent to the LLM to generate the next response.
Using our Memory class, we can now implement the sequential optimization approach. Let's code that.
class SequentialMemory(BaseMemoryStrategy):
def __init__(self):
"""Initializes the memory with an empty list to store conversation history."""
self.history = []
def add_message(self, user_input: str, ai_response: str):
"""
Adds a new user-AI interaction to the history.
Each interaction is stored as two dictionary entries in the list.
"""
self.history.append({"role": "user", "content": user_input})
self.history.append({"role": "assistant", "content": ai_response})
def get_context(self, query: str) -> str:
"""
Retrieves the entire conversation history and formats it into a single
string to be used as context for the LLM. The 'query' parameter is ignored
as this strategy always returns the full history.
"""
return "\n".join([f"{turn['role'].capitalize()}: {turn['content']}" for turn in self.history])
def clear(self):
"""Resets the conversation history by clearing the list."""
self.history = []
print("Sequential memory cleared.")
Now you might understand what our base Memory class is doing here. Our subclasses (each approach) will implement the same abstract methods that we define throughout the guide.
Let’s quickly go over the code to understand how it works.
__init__(self): Initializes an empty self.history list to store the conversation.
add_message(...): Adds the user's input and AI's response to the history.
get_context(...): Formats and joins the history into a single "Role: Content" string as context.
clear(): Resets the history for a new conversation.
We can initialize the memory class and build the AI agent on top of it.
sequential_memory = SequentialMemory()
agent = AIAgent(memory_strategy=sequential_memory)
To test our sequential approach, we need to create a multi-turn chat conversation. Let’s do that.
agent.chat("Hi there! My name is Sam.")
agent.chat("I'm interested in learning about space exploration.")
agent.chat("What was the first thing I told you?")
==== NEW INTERACTION ====
User: Hi there! My name is Sam.
Bot: Hello Sam! Nice to meet you. What brings you here today?
>>>> Tokens: 23 | Response Time: 2.25s
==== NEW INTERACTION ====
User: I am interested in learning about space exploration.
Bot: Awesome! Are you curious about:
- Mars missions
- Space agencies
- Private companies (e.g., SpaceX)
- Space tourism
- Search for alien life?
...
>>>> Tokens: 92 | Response Time: 4.46s
==== NEW INTERACTION ====
User: What was the first thing I told you?
Bot: You said, "Hi there! My name is Sam."
...
>>>> Tokens: 378 | Response Time: 0.52s
The conversation is pretty smooth, but if you pay attention to the token calculation, you’ll notice that it gets bigger and bigger after each turn. Our agent isn’t dependent on any external tool that would significantly increase the token size, so this growth is purely due to the sequential accumulation of messages.
While the sequential approach is easy to implement, it has a major drawback:
The bigger your agent conversation gets, the more expensive the token cost becomes, so a sequential approach is quite costly.
Sliding Window Approach
To avoid the issue of a large context, the next approach we will focus on is the sliding window approach, where our agent doesn’t need to remember all previous messages, but only the context from a certain number of recent messages.
Instead of retaining the entire conversation history, the agent keeps only the most recent N messages as context. As new messages arrive, the oldest ones are dropped, and the window slides forward.
Sliding Window Approach (Created by )
The process is simple:
Define a fixed window size, say N = 2 turns.
The first two turns fill up the window.
When the third turn happens, the very first turn is pushed out of the window to make space.
The context sent to the LLM is only what’s currently inside the window.
Now, we can implement the Sliding Window Memory class.
class SlidingWindowMemory(BaseMemoryStrategy):
def __init__(self, window_size: int = 4):
"""
Initializes the memory with a deque of a fixed size.
Args:
window_size: The number of conversational turns to keep in memory.
A single turn consists of one user message and one AI response.
"""
self.history = deque(maxlen=window_size)
def add_message(self, user_input: str, ai_response: str):
"""
Adds a new conversational turn to the history. If the deque is full,
the oldest turn is automatically removed.
"""
self.history.append([
{"role": "user", "content": user_input},
{"role": "assistant", "content": ai_response}
])
def get_context(self, query: str) -> str:
"""
Retrieves the conversation history currently within the window and
formats it into a single string. The 'query' parameter is ignored.
"""
context_list = []
for turn in self.history:
for message in turn:
context_list.append(f"{message['role'].capitalize()}: {message['content']}")
return "\n".join(context_list)
Our sequential and sliding memory classes are quite similar. The key difference is that we’re adding a window to our context. Let’s quickly go through the code.
__init__(self, window_size=2): Sets up a deque with a fixed size, enabling automatic sliding of the context window.
add_message(...): Adds a new turn, old entries are dropped when the deque is full.
get_context(...): Builds the context from only the messages within the current sliding window.
Let’s initialize the sliding window state memory and build the AI agent on top of it.
sliding_memory = SlidingWindowMemory(window_size=2)
agent = AIAgent(memory_strategy=sliding_memory)
We are using a small window size of 2, which means the agent will remember only the last two messages. To test this optimization approach, we need a multi-turn conversation. So, let’s first try a straightforward conversation.
agent.chat("My name is Priya and I'm a software developer.")
agent.chat("I work primarily with Python and cloud technologies.")
agent.chat("My favorite hobby is hiking.")
==== NEW INTERACTION ====
User: My name is Priya and I am a software developer.
Bot: Nice to meet you, Priya! What can I assist you with today?
>>>> Tokens: 27 | Response Time: 1.10s
==== NEW INTERACTION ====
User: I work primarily with Python and cloud technologies.
Bot: That is great! Given your expertise...
>>>> Tokens: 81 | Response Time: 1.40s
==== NEW INTERACTION ====
User: My favorite hobby is hiking.
Bot: It seems we had a nice conversation about your background...
>>>> Tokens: 167 | Response Time: 1.59s
The conversation is quite similar and simple, just like we saw earlier in the sequential approach. However, now if the user asks the agent something that doesn’t exist within the sliding window, let’s observe the expected output.
agent.chat("What is my name?")
==== NEW INTERACTION ====
User: What is my name?
Bot: I apologize, but I dont have access to your name from our recent
conversation. Could you please remind me?
>>>> Tokens: 197 | Response Time: 0.60s
The AI agent couldn’t answer the question because the relevant context was outside the sliding window. However, we did see a reduction in token count due to this optimization.
The downside is clear, important context may be lost if the user refers back to earlier information. The sliding window is a crucial factor to consider and should be tailored based on the specific type of AI agent we are building.
Summarization Based Optimization
As we’ve seen earlier, the sequential approach suffers from a gigantic context issue, while the sliding window approach risks losing important context.
Therefore, there’s a need for an approach that can address both problems, by compacting the context without losing essential information. This can be achieved through summarization.
Summarization Approach (Created by )
Instead of simply dropping old messages, this strategy periodically uses the LLM itself to create a running summary of the conversation. It works like this:
Recent messages are stored in a temporary holding area, called a “buffer”.
Once this buffer reaches a certain size (a “threshold”), the agent pauses and triggers a special action.
It sends the contents of the buffer, along with the previous summary, to the LLM with a specific instruction: “Create a new, updated summary that incorporates these recent messages”.
The LLM generates a new, consolidated summary. This new summary replaces the old one, and the buffer is cleared.
Let’s implement the summarization optimization approach and observe how it affects the agent’s performance.
class SummarizationMemory(BaseMemoryStrategy):
def __init__(self, summary_threshold: int = 4):
"""
Initializes the summarization memory.
Args:
summary_threshold: The number of messages (user + AI) to accumulate in the
buffer before triggering a summarization.
"""
self.running_summary = ""
self.buffer = []
self.summary_threshold = summary_threshold
def add_message(self, user_input: str, ai_response: str):
"""
Adds a new user-AI interaction to the buffer. If the buffer size
reaches the threshold, it triggers the memory consolidation process.
"""
self.buffer.append({"role": "user", "content": user_input})
self.buffer.append({"role": "assistant", "content": ai_response})
if len(self.buffer) >= self.summary_threshold:
self._consolidate_memory()
def _consolidate_memory(self):
"""
Uses the LLM to summarize the contents of the buffer and merge it
with the existing running summary.
"""
print("\n--- [Memory Consolidation Triggered] ---")
buffer_text = "\n".join([f"{msg['role'].capitalize()}: {msg['content']}" for msg in self.buffer])
summarization_prompt = (
f"You are a summarization expert. Your task is to create a concise summary of a conversation. "
f"Combine the 'Previous Summary' with the 'New Conversation' into a single, updated summary. "
f"Capture all key facts, names, and decisions.\n\n"
f"### Previous Summary:\n{self.running_summary}\n\n"
f"### New Conversation:\n{buffer_text}\n\n"
f"### Updated Summary:"
)
new_summary = generate_text("You are an expert summarization engine.", summarization_prompt)
self.running_summary = new_summary
self.buffer = []
print(f"--- [New Summary: '{self.running_summary}'] ---")
def get_context(self, query: str) -> str:
"""
Constructs the context to be sent to the LLM. It combines the long-term
running summary with the short-term buffer of recent messages.
The 'query' parameter is ignored as this strategy provides a general context.
"""
buffer_text = "\n".join([f"{msg['role'].capitalize()}: {msg['content']}" for msg in self.buffer])
return f"### Summary of Past Conversation:\n{self.running_summary}\n\n### Recent Messages:\n{buffer_text}"
Our summarization memory component is a bit different compared to the previous approaches. Let’s break down and understand the component we’ve just coded.
__init__(...): Sets up an empty running_summary string and an empty buffer list.
add_message(...): Adds messages to the buffer. If the buffer size meets our summary_threshold, it calls the private _consolidate_memory method.
_consolidate_memory(): This is the new, important part. It formats the buffer content and the existing summary into a special prompt, asks the LLM to create a new summary, updates self.running_summary, and clears the buffer.
get_context(...): Provides the LLM with both the long-term summary and the short-term buffer, giving it a complete picture of the conversation.
Let’s initialize the summary memory component and build the AI agent on top of it.
summarization_memory = SummarizationMemory(summary_threshold=4)
agent = AIAgent(memory_strategy=summarization_memory)
The initialization is done in the same way as we saw earlier. We’ve set the summary threshold to 4, which means after every 2 turns, a summary will be generated and passed as context to the AI agent, instead of the entire or sliding window conversation history.
This aligns with the core goal of the summarization approach, saving tokens while retaining important information.
Let’s test this approach and evaluate how efficient it is in terms of token usage and preserving relevant context.
agent.chat("I'm starting a new company called 'Innovatech'. Our focus is on sustainable energy.")
agent.chat("Our first product will be a smart solar panel, codenamed 'Project Helios'.")
==== NEW INTERACTION ====
User: I am starting a new company called 'Innovatech'. Ou...
Bot: Congratulations on starting Innovatech! Focusing o ...
>>>> Tokens: 45 | Response Time: 2.55s
==== NEW INTERACTION ====
User: Our first product will be a smart solar panel....
--- [Memory Consolidation Triggered] ---
--- [New Summary: The user started a compan ...
Bot: That is exciting news about ....
>>>> Tokens: 204 | Response Time: 3.58s
So far, we’ve had two basic conversation turns. Since we’ve set the summary generator parameter to 2, a summary will now be generated for those previous turns.
Let’s proceed with the next turn and observe the impact on token usage.
agent.chat("The marketing budget is set at $50,000.")
agent.chat("What is the name of my company and its first product?")
...
==== NEW INTERACTION ====
User: What is the name of my company and its first product?
Bot: Your company is called 'Innovatech' and its first product is codenamed 'Project Helios'.
>>>> Tokens: 147 | Response Time: 1.05s
Did you notice that in our fourth conversation, the token count dropped to nearly half of what we saw in the sequential and sliding window approaches? That’s the biggest advantage of the summarization approach, it greatly reduces token usage.
However, for it to be truly effective, your summarization prompts need to be carefully crafted to ensure they capture the most important details.
The main downside is that critical information can still be lost in the summarization process. For example, if you continue a conversation for up to 40 turns and include numeric or factual details, such as balance sheet data, there’s a risk that earlier key info (like the gross sales mentioned in the 4th turn) may not appear in the summary anymore.
Let’s take a look at this example, where you had a 40-turn conversation with the AI agent and included several numeric details.
The summary used as context failed to include the gross sales figure from the 4th conversation, which is a clear limitation of this approach.
agent.chat("what was the gross sales of our company in the fiscal year?")
...
==== NEW INTERACTION ====
User: what was the gross sales of our company in the fiscal year?
Bot: I am sorry but I do not have that information. Could you please provide the gross sales figure for the fiscal year?
>>>> Tokens: 1532 | Response Time: 2.831s
You can see that although the summarized information uses fewer tokens, the answer quality and accuracy can decrease significantly or even drop to zero because of problematic context being passed to the AI agent.
This highlights the importance of creating a sub-agent dedicated to fact-checking the LLM’s responses. Such a sub-agent can verify factual accuracy and help make the overall agent more reliable and powerful.
Retrieval Based Memory
This is the most powerful strategy used in many AI agent use cases: RAG-based AI agents. As we saw earlier, previous approaches reduce token usage but risk losing relevant context. RAG, however, is different it retrieves relevant context based on the current user query.
The context is stored in a database, where embedding models play a crucial role by transforming text into vector representations that make retrieval efficient.
Let’s visualize how this process works.
RAG Based Memory
Let’s understand the workflow of RAG-based memory:
Every time a new interaction happens, it’s not just stored in a list, it’s saved as a “document” in a specialized database. We also generate a numerical representation of this document’s meaning, called an embedding, and store it.
When the user sends a new message, the agent first converts this new message into an embedding as well.
It then uses this query embedding to perform a similarity search against all the document embeddings stored in its memory database.
The system retrieves the top k most semantically relevant documents (e.g., the 3 most similar past conversation turns).
Finally, only these highly relevant, retrieved documents are injected into the LLM’s context window.
We will be using FAISS for vector storage in this approach. Let’s code this memory component.
import numpy as np
import faiss
class RetrievalMemory(BaseMemoryStrategy):
def __init__(self, k: int = 2, embedding_dim: int = 3584):
"""
Initializes the retrieval memory system.
Args:
k: The number of top relevant documents to retrieve for a given query.
embedding_dim: The dimension of the vectors generated by the embedding model.
For BAAI/bge-multilingual-gemma2, this is 3584.
"""
self.k = k
self.embedding_dim = embedding_dim
self.documents = []
self.index = faiss.IndexFlatL2(self.embedding_dim)
def add_message(self, user_input: str, ai_response: str):
"""
Adds a new conversational turn to the memory. Each part of the turn (user
input and AI response) is embedded and indexed separately for granular retrieval.
"""
docs_to_add = [
f"User said: {user_input}",
f"AI responded: {ai_response}"
]
for doc in docs_to_add:
embedding = generate_embedding(doc)
if embedding:
self.documents.append(doc)
vector = np.array([embedding], dtype='float32')
self.index.add(vector)
def get_context(self, query: str) -> str:
"""
Finds the k most relevant documents from memory based on semantic
similarity to the user's query.
"""
if self.index.ntotal == 0:
return "No information in memory yet."
query_embedding = generate_embedding(query)
if not query_embedding:
return "Could not process query for retrieval."
query_vector = np.array([query_embedding], dtype='float32')
distances, indices = self.index.search(query_vector, self.k)
retrieved_docs = [self.documents for i in indices[0] if i != -1]
if not retrieved_docs:
return "Could not find any relevant information in memory."
return "### Relevant Information Retrieved from Memory:\n" + "\n---\n".join(retrieved_docs)
Let’s go through what’s happening in the code.
__init__(...): We initialize a list for our text documents and a faiss.IndexFlatL2 to store and search our vectors. We must specify the embedding_dim, which is the size of the vectors our embedding model produces.
add_message(...): For each turn, we generate an embedding for both the user and AI messages, add the text to our documents list, and add the corresponding vector to our FAISS index.
get_context(...): This is important. It embeds the user's query, uses self.index.search to find the k most similar vectors, and then uses their indices to pull the original text from our documents list. This retrieved text becomes the context.
As before, we initialize our memory state and build the AI agent using it.
retrieval_memory = RetrievalMemory(k=2)
agent = AIAgent(memory_strategy=retrieval_memory)
We are setting k = 2, which means we fetch only two relevant chunks related to the user's query. When dealing with larger datasets, we typically set k to a higher value such as 5, 7, or even more especially if the chunk size is very small.
Let's test our AI agent with this setup.
agent.chat("I am planning a vacation to Japan for next spring.")
agent.chat("For my software project, I'm using the React framework for the frontend.")
agent.chat("I want to visit Tokyo and Kyoto while I'm on my trip.")
agent.chat("The backend of my project will be built with Django.")
...
==== NEW INTERACTION ====
User: I want to visit Tokyo and Kyoto while I'm on my trip.
Bot: You're interested in visiting Tokyo and Kyoto...
...
These are just basic conversations that we typically run with an AI agent. Now, let’s try a newer conversation based on past information and see how well the relevant context is retrieved and how optimized the token usage is in that scenario.
agent.chat("What cities am I planning to visit on my vacation?")
==== NEW INTERACTION ====
User: What cities am I planning to visit on my vacation?
--- Agent Debug Info ---
[Full Prompt Sent to LLM]:
---
SYSTEM: You are a helpful AI assistant.
USER: MEMORY CONTEXT
Relevant Information Retrieved from Memory:
User said: I want to visit Tokyo and Kyoto while I am on my trip.
---
User said: I am planning a vacation to Japan for next spring.
...
Bot: You are planning to visit Tokyo and Kyoto while on your vacation to Japan next spring.
>>>> Tokens: 65 | Response Time: 0.53s
You can see that the relevant context has been successfully fetched, and the token count is extremely low because we’re retrieving only the pertinent information.
The choice of embedding model and the vector storage database plays a crucial role here. Optimizing that database is another important step to ensure fast and accurate retrieval. FAISS is a popular choice because it offers these capabilities.
However, the downside is that this approach is more complex to implement than it seems. As the database grows larger, the AI agent’s complexity increases significantly.
You’ll likely need parallel query processing and other optimization techniques to maintain performance. Despite these challenges, this approach remains the industry standard for optimizing AI agents.
Memory Augmented Transformers
Beyond these core strategies, AI systems are implementing even more sophisticated approaches that push the boundaries of what’s possible.
We can understand this technique through an example, imagine a regular AI like a student with just one small notepad. They can only write a little bit at a time. So in a long test, they have to erase old notes to make room for new ones.
Now, memory-augmented transformers are like giving that student a bunch of sticky notes. The notepad still handles the current work, but the sticky notes help them save key info from earlier.
For example: you’re designing a video game with an AI. Early on, you say you want it to be set in space with no violence. Normally, that would get forgotten after a long talk. But with memory, the AI writes “space setting, no violence” on a sticky note.
Later, when you ask, “What characters would fit our game?”, it checks the note and gives ideas that match your original vision, even hours later.
It’s like having a smart helper who remembers the important stuff without needing you to repeat it.
Let’s visualize this:
Memory Augmented Transformers
We will create a memory class that:
Uses a SlidingWindowMemory for recent chat.
After each turn, uses the LLM to act as a “fact extractor.” It will analyze the conversation and decide if it contains a core fact, preference, or decision.
If an important fact is found, it’s stored as a memory token (a concise string) in a separate list.
The final context provided to the agent is a combination of the recent chat window and all the persistent memory tokens.
class MemoryAugmentedMemory(BaseMemoryStrategy):
def __init__(self, window_size: int = 2):
"""
Initializes the memory-augmented system.
Args:
window_size: The number of recent turns to keep in the short-term memory.
"
""
self.recent_memory = SlidingWindowMemory(window_size=window_size)
self.memory_tokens = []
def add_message(self, user_input: str, ai_response: str):
"""
Adds the latest turn to recent memory and then uses an LLM call to decide
if a new, persistent memory token should be created from this interaction.
"""
self.recent_memory.add_message(user_input, ai_response)
fact_extraction_prompt = (
f"Analyze the following conversation turn. Does it contain a core fact, preference, or decision that should be remembered long-term? "
f"Examples include user preferences ('I hate flying'), key decisions ('The budget is $1000'), or important facts ('My user ID is 12345').\n\n"
f"Conversation Turn:\nUser: {user_input}\nAI: {ai_response}\n\n"
f"If it contains such a fact, state the fact concisely in one sentence. Otherwise, respond with 'No important fact.'"
)
extracted_fact = generate_text("You are a fact-extraction expert.", fact_extraction_prompt)
if "no important fact" not in extracted_fact.lower():
print(f"--- [Memory Augmentation: New memory token created: '{extracted_fact}'] ---")
self.memory_tokens.append(extracted_fact)
def get_context(self, query: str) -> str:
"""
Constructs the context by combining the short-term recent conversation
with the list of all long-term, persistent memory tokens.
"""
recent_context = self.recent_memory.get_context(query)
memory_token_context = "\n".join([f"- {token}" for token in self.memory_tokens])
return f"### Key Memory Tokens (Long-Term Facts):\n{memory_token_context}\n\n### Recent Conversation:\n{recent_context}"
Our augmented class might be confusing at first glance, but let’s understand this:
__init__(...): Initializes both a SlidingWindowMemory instance and an empty list for memory_tokens.
add_message(...): This method now has two jobs. It adds the turn to the sliding window and makes an extra LLM call to see if a key fact should be extracted and added to self.memory_tokens.
get_context(...): Constructs a rich prompt by combining the "sticky notes" (memory_tokens) with the recent chat history from the sliding window.
Let’s initialize this memory-augmented state and AI agent.
mem_aug_memory = MemoryAugmentedMemory(window_size=2)
agent = AIAgent(memory_strategy=mem_aug_memory)
We are using a window size of 2, just as we set previously. Now, we can simply test this approach using a multi-turn chat conversation and see how well it performs.
agent.chat("Please remember this for all future interactions: I am severely allergic to peanuts.")
agent.chat("Okay, let's talk about recipes. What's a good idea for dinner tonight?")
agent.chat("That sounds good. What about a dessert option?")
==== NEW INTERACTION ====
User: Please remember this for all future interactions: I am severely allergic to peanuts.
--- [Memory Augmentation: New memory token created: 'The user has a severe allergy to peanuts.'] ---
Bot: I have taken note of your long-term fact: You are severely allergic to peanuts. I will keep this in mind...
>>>> Tokens: 45 | Response Time: 1.32s
...
The conversation is the same as with an ordinary AI agent. Now, let’s test the memory-augmented technique by including a new method.
agent.chat("Could you suggest a Thai green curry recipe? Please ensure it's safe for me.")
==== NEW INTERACTION ====
User: Could you suggest a Thai green curry recipe? Please ensure it is safe for me.
--- Agent Debug Info ---
[Full Prompt Sent to LLM]:
---
SYSTEM: You are a helpful AI assistant.
USER: MEMORY CONTEXT
Key Memory Tokens (Long-Term Facts):
- The user has a severe allergy to peanuts.
...
Recent Conversation:
User: Okay, lets talk about recipes...
...
Bot: Of course. Given your peanut allergy, it is very important to be careful with Thai cuisine as many recipes use peanuts or peanut oil. Here is a peanut-free Thai green curry recipe...
>>>> Tokens: 712 | Response Time: 6.45s
This approach can be deeply evaluated on a larger dataset in a better way since the transformer model used here requires many confidential solutions; this approach might be a better option.
It is a more complex and expensive strategy due to the extra LLM calls for fact extraction, but its ability to retain critical information over long, evolving conversations makes it incredibly powerful for building truly reliable and intelligent personal assistants.
Hierarchical Optimization for Multi-tasks
So far, we have treated memory as a single system. But what if we could build an agent that thinks more like a human, with different types of memory for different purposes?
This is the idea behind Hierarchical Memory. It’s a composite strategy that combines multiple, simpler memory types into a layered system, creating a more sophisticated and organized mind for our agent.
Think about how you remember things:
Working Memory: The last few sentences someone said to you. It’s fast, but fleeting.
Short-Term Memory: The main points from a meeting you had this morning. You can recall them easily for a few hours.
Long-Term Memory: Your home address or a critical fact you learned years ago. It’s durable and deeply ingrained.
Hierarchical Optimization
Hierarchical approach works like this …
It starts with capturing the user message into working memory.
Then it checks if the information is important enough to promote to long-term memory.
After that, promoted content is stored in a retrieval memory for future use.
On new queries, it searches long-term memory for relevant context.
Finally, it injects relevant memories into context to generate better responses.
Let’s build this component.
class HierarchicalMemory(BaseMemoryStrategy):
def __init__(self, window_size: int = 2, k: int = 2, embedding_dim: int = 3584):
"""
Initializes the hierarchical memory system.
Args:
window_size: The size of the short-term working memory (in turns).
k: The number of documents to retrieve from long-term memory.
embedding_dim: The dimension of the embedding vectors for long-term memory.
"""
print("Initializing Hierarchical Memory...")
self.working_memory = SlidingWindowMemory(window_size=window_size)
self.long_term_memory = RetrievalMemory(k=k, embedding_dim=embedding_dim)
self.promotion_keywords = ["remember", "rule", "preference", "always", "never", "allergic"]
def add_message(self, user_input: str, ai_response: str):
"""
Adds a message to working memory and conditionally promotes it to long-term
memory based on its content.
"""
self.working_memory.add_message(user_input, ai_response)
if any(keyword in user_input.lower() for keyword in self.promotion_keywords):
print(f"--- [Hierarchical Memory: Promoting message to long-term storage.] ---")
self.long_term_memory.add_message(user_input, ai_response)
def get_context(self, query: str) -> str:
"""
Constructs a rich context by combining relevant information from both
the long-term and short-term memory layers.
"""
working_context = self.working_memory.get_context(query)
long_term_context = self.long_term_memory.get_context(query)
return f"### Retrieved Long-Term Memories:\n{long_term_context}\n\n### Recent Conversation (Working Memory):\n{working_context}"
So …
__init__(...): Initializes an instance of SlidingWindowMemory and an instance of RetrievalMemory. It also defines a list of promotion_keywords.
add_message(...): Adds every message to the short-term working_memory. It then checks if the user_input contains any of the special keywords. If it does, the message is also added to the long_term_memory.
get_context(...): This is where the hierarchy comes together. It fetches context from both memory systems and combines them into one rich prompt, giving the LLM both recent conversational flow and relevant deep facts.
Let’s now initialize the memory component and AI agent.
hierarchical_memory = HierarchicalMemory()
agent = AIAgent(memory_strategy=hierarchical_memory)
We can now create a multi-turn chat conversation for this technique.
agent.chat("Please remember my User ID is AX-7890.")
agent.chat("Let's chat about the weather. It's very sunny today.")
agent.chat("I'm planning to go for a walk later.")
agent.chat("I need to log into my account, can you remind me of my ID?")
We are testing this with a scenario where the user provides an important piece of information (a User ID) using a keyword (“remember”).
Then, we now have a few turns of unrelated chat. In the last tern we are asking the agent to recall the ID. let’s look at the output of the ai agent.
==== NEW INTERACTION ====
User: Please remember my User ID is AX-7890.
--- [Hierarchical Memory: Promoting message to long-term storage.] ---
Bot: You have provided your User ID as AX-7890, which has been stored in long-term memory for future reference.
...
==== NEW INTERACTION ====
User: I need to log into my account, can you remind me of my ID?
--- Agent Debug Info ---
[Full Prompt Sent to LLM]:
---
SYSTEM: You are a helpful AI assistant.
USER:
User said: Please remember my User ID is AX-7890.
...
User: Let's chat about the weather...
User: I'm planning to go for a walk later...
Bot: Your User ID is AX-7890. You can use this to log into your account. Is there anything else I can assist you with?
>>>> Tokens: 452 | Response Time: 2.06s
As you can see, the agent successfully combines different memory types. It uses the fast working memory for the flow of conversation but correctly queries its deep, long-term memory to retrieve the critical User ID when asked.
This hybrid approach is a powerful pattern for building sophisticated agents.
Graph Based Optimization
So far, our memory has stored information as chunks of text, whether it’s the full conversation, a summary, or a retrieved document. But what if we could teach our agent to understand the relationships between different pieces of information? This is the leap we take with Graph-Based Memory.
This strategy moves beyond storing unstructured text and represents information as a knowledge graph.
A knowledge graph consists of:
Nodes (or Entities): These are the "things" in our conversation, like people (Clara), companies (FutureScape), or concepts (Project Odyssey).
Edges (or Relations): These are the connections that describe how the nodes relate to each other, like works_for, is_based_in, or manages.
The result is a structured, web-like memory. Instead of a simple fact like "Clara works for FutureScape," the agent stores a connection: (Clara) --[works_for]--> (FutureScape).
Graph Based Approach
This is incredibly powerful for answering complex queries that require reasoning about connections. The main challenge is populating the graph from unstructured conversation.
For this, we can use a powerful technique: using the LLM itself as a tool to extract structured (Subject, Relation, Object) triples from the text.
For our implementation, we’ll use the networkx library to build and manage our graph. The core of this strategy will be a helper method, _extract_triples, that calls the LLM with a specific prompt to convert conversational text into structured (Subject, Relation, Object) data.
class GraphMemory(BaseMemoryStrategy):
def __init__(self):
"""Initializes the memory with an empty NetworkX directed graph."""
self.graph = nx.DiGraph()
def _extract_triples(self, text: str) -> list[tuple[str, str, str]]:
"""
Uses the LLM to extract knowledge triples (Subject, Relation, Object) from a given text.
This is a form of "LLM as a Tool" where the model's language understanding is
used to create structured data.
"""
print("--- [Graph Memory: Attempting to extract triples from text.] ---")
extraction_prompt = (
f"You are a knowledge extraction engine. Your task is to extract Subject-Relation-Object triples from the given text. "
f"Format your output strictly as a list of Python tuples. For example: [('Sam', 'works_for', 'Innovatech'), ('Innovatech', 'focuses_on', 'Energy')]. "
f"If no triples are found, return an empty list [].\n\n"
f"Text to analyze:\n\"""{text}\""""
)
response_text = generate_text("You are an expert knowledge graph extractor.", extraction_prompt)
try:
found_triples = re.findall(r"\(['\"](.*?)['\"],\s*['\"](.*?)['\"],\s*['\"](.*?)['\"]\)", response_text)
print(f"--- [Graph Memory: Extracted triples: {found_triples}] ---")
return found_triples
except Exception as e:
print(f"Could not parse triples from LLM response: {e}")
return []
def add_message(self, user_input: str, ai_response: str):
"""Extracts triples from the latest conversation turn and adds them to the knowledge graph."""
full_text = f"User: {user_input}\nAI: {ai_response}"
triples = self._extract_triples(full_text)
for subject, relation, obj in triples:
self.graph.add_edge(subject.strip(), obj.strip(), relation=relation.strip())
def get_context(self, query: str) -> str:
"""
Retrieves context by finding entities from the query in the graph and
returning all their known relationships.
"""
if not self.graph.nodes:
return "The knowledge graph is empty."
query_entities = [word.capitalize() for word in query.replace('?','').split() if word.capitalize() in self.graph.nodes]
if not query_entities:
return "No relevant entities from your query were found in the knowledge graph."
context_parts = []
for entity in set(query_entities):
for u, v, data in self.graph.out_edges(entity, data=True):
context_parts.append(f"{u} --[{data['relation']}]--> {v}")
for u, v, data in self.graph.in_edges(entity, data=True):
context_parts.append(f"{u} --[{data['relation']}]--> {v}")
return "### Facts Retrieved from Knowledge Graph:\n" + "\n".join(sorted(list(set(context_parts))))
_extract_triples(…): This is the engine of the strategy. It sends the conversation text to the LLM with a highly specific prompt, asking it to return structured data.
add_message(…): This method orchestrates the process. It calls _extract_triples on the new conversation turn and then adds the resulting subject-relation-object pairs as edges to the networkx graph.
get_context(…): This performs a simple search. It looks for entities from the user's query that exist as nodes in the graph. If it finds any, it retrieves all known relationships for those entities and provides them as structured context.
Let’s see if our agent can build a mental map of a scenario and then use it to answer a question that requires connecting the dots.
You’ll see the [Graph Memory: Extracted triples] log after each turn, showing how the agent is building its knowledge base in real-time
The final context won’t be conversational text but rather a structured list of facts retrieved from the graph.
graph_memory = GraphMemory()
agent = AIAgent(memory_strategy=graph_memory)
agent.chat("A person named Clara works for a company called 'FutureScape'.")
agent.chat("FutureScape is based in Berlin.")
agent.chat("Clara's main project is named 'Odyssey'.")
agent.chat("Tell me about Clara's project.")
The output we get after this multi-turn chat is:
############ OUTPUT ############
==== NEW INTERACTION ====
User: A person named Clara works for a company called 'FutureScape'.
--- [Graph Memory: Attempting to extract triples from text.] ---
--- [Graph Memory: Extracted triples: [('Clara', 'works_for', 'FutureScape')]] ---
Bot: Understood. I've added the fact that Clara works for FutureScape to my knowledge graph.
...
==== NEW INTERACTION ====
User: Clara's main project is named 'Odyssey'.
--- [Graph Memory: Attempting to extract triples from text.] ---
--- [Graph Memory: Extracted triples: [('Clara', 'manages_project', 'Odyssey')]] ---
Bot: Got it. I've noted that Clara's main project is Odyssey.
==== NEW INTERACTION ====
User: Tell me about Clara's project.
--- Agent Debug Info ---
[Full Prompt Sent to LLM]:
---
SYSTEM: You are a helpful AI assistant.
USER: ### MEMORY CONTEXT
### Facts Retrieved from Knowledge Graph:
Clara --[manages_project]--> Odyssey
Clara --[works_for]--> FutureScape
...
Bot: Based on my knowledge graph, Clara's main project is named 'Odyssey', and Clara works for the company FutureScape.
>>>> Tokens: 78 | Response Time: 1.5s
The agent didn’t just find a sentence containing “Clara” and “project”, it navigated its internal graph to present all known facts related to the entities in the query.
This opens the door to building highly knowledgeable expert agents.
Compression & Consolidation Memory
We have seen that summarization is a good way to manage long conversations, but what if we could be even more aggressive in cutting down token usage? This is where Compression & Consolidation Memory comes into play. It’s like summarization’s more intense sibling.
Instead of creating a narrative summary that tries to preserve the conversational flow, the goal here is to distill each piece of information into its most dense, factual representation.
Think of it like converting a long, verbose paragraph from a meeting transcript into a single, concise bullet point.
Compression Approach
The process is straightforward:
After each conversational turn (user input + AI response), the agent sends this text to the LLM.
It uses a specific prompt that asks the LLM to act like a “data compression engine”.
The LLM’s task is to re-write the turn as a single, essential statement, stripping out all conversational fluff like greetings, politeness, and filler words.
This highly compressed fact is then stored in a simple list.
The memory of the agent becomes a lean, efficient list of core facts, which can be significantly more token-efficient than even a narrative summary.
class CompressionMemory(BaseMemoryStrategy):
def __init__(self):
"""Initializes the memory with an empty list to store compressed facts."""
self.compressed_facts = []
def add_message(self, user_input: str, ai_response: str):
"""Uses the LLM to compress the latest turn into a concise factual statement."""
text_to_compress = f"User: {user_input}\nAI: {ai_response}"
compression_prompt = (
f"You are a data compression engine. Your task is to distill the following text into its most essential, factual statement. "
f"Be as concise as possible, removing all conversational fluff. For example, 'User asked about my name and I, the AI, responded that my name is an AI assistant' should become 'User asked for AI's name.'\n\n"
f"Text to compress:\n\"{text_to_compress}\""
)
compressed_fact = generate_text("You are an expert data compressor.", compression_prompt)
print(f"--- [Compression Memory: New fact stored: '{compressed_fact}'] ---")
self.compressed_facts.append(compressed_fact)
def get_context(self, query: str) -> str:
"""Returns the list of all compressed facts, formatted as a bulleted list."""
if not self.compressed_facts:
return "No compressed facts in memory."
return "### Compressed Factual Memory:\n- " + "\n- ".join(self.compressed_facts)
__init__(...): Simply creates an empty list, self.compressed_facts.
add_message(...): The core logic. It takes the latest turn, sends it to the LLM with the compression prompt, and stores the concise result.
get_context(...): Formats the list of compressed facts into a clean, bulleted list to be used as context.
Let’s test this strategy with a simple planning conversation.
After each turn, you will see the [Compression Memory: New fact stored] log, showing the very short, compressed version of the interaction. Notice how the final context sent to the LLM is just a terse list of facts, which is highly token-efficient.
compression_memory = CompressionMemory()
agent = AIAgent(memory_strategy=compression_memory)
agent.chat("Okay, I've decided on the venue for the conference. It's going to be the 'Metropolitan Convention Center'.")
agent.chat("The date is confirmed for October 26th, 2025.")
agent.chat("Could you please summarize the key details for the conference plan?")
Once we perform this multi-turn chat conversation, we can take a look at the output. Let’s do that.
############ OUTPUT ############
==== NEW INTERACTION ====
User: Okay, I've decided on the venue for the conference. It's going to be the 'Metropolitan Convention Center'.
--- [Compression Memory: New fact stored: 'The conference venue has been decided as the 'Metropolitan Convention Center'.'] ---
Bot: Great! The Metropolitan Convention Center is an excellent choice. What's next on our planning list?
...
==== NEW INTERACTION ====
User: The date is confirmed for October 26th, 2025.
--- [Compression Memory: New fact stored: 'The conference date is confirmed for October 26th, 2025.'] ---
Bot: Perfect, I've noted the date.
...
==== NEW INTERACTION ====
User: Could you please summarize the key details for the conference plan?
--- Agent Debug Info ---
[Full Prompt Sent to LLM]:
---
SYSTEM: You are a helpful AI assistant.
USER: ### MEMORY CONTEXT
### Compressed Factual Memory:
- The conference venue has been decided as the 'Metropolitan Convention Center'.
- The conference date is confirmed for October 26th, 2025.
...
Bot: Of course. Based on my notes, here are the key details for the conference plan:
- **Venue:** Metropolitan Convention Center
- **Date:** October 26th, 2025
>>>> Tokens: 48 | Response Time: 1.2s
As you can see, this strategy is extremely effective at reducing token count while preserving core facts. It’s a great choice for applications where long-term factual recall is needed on a tight token budget.
However, for conversations that rely heavily on nuance and personality, this aggressive compression might be too much.
OS-Like Memory Management
What if we could build a memory system for our agent that works just like the memory in your computer?
This advanced concept borrows directly from how a computer’s Operating System (OS) manages RAM and a hard disk.
Let’s use an analogy:
RAM (Random Access Memory): This is the super-fast memory your computer uses for active programs. It’s expensive and you don’t have a lot of it. For our agent, the LLM’s context window is its RAM — it’s fast to access but very limited in size.
Hard Disk (or SSD): This is your computer’s long-term storage. It’s much larger and cheaper than RAM, but also slower to access. For our agent, this can be an external database or a simple file where we store old conversation history.
OS Like Memory Management
This memory strategy works by intelligently moving information between these two tiers:
Active Memory (RAM): The most recent conversation turns are kept here, in a small, fast-access buffer.
Passive Memory (Disk): When the active memory is full, the oldest information is moved out to the passive, long-term storage. This is called “paging out.”
Page Fault: When the user asks a question that requires information not currently in the active memory, a “page fault” occurs.
The system must then go to its passive storage, find the relevant information, and load it back into the active context for the LLM to use. This is called “paging in.”
Our simulation will create an active_memory (a deque, like a sliding window) and a passive_memory (a dictionary). When the active memory is full, we'll page out the oldest turn.
To page in, we will use a simple keyword search to simulate a retrieval from passive memory.
class OSMemory(BaseMemoryStrategy):
def __init__(self, ram_size: int = 2):
"""
Initializes the OS-like memory system.
Args:
ram_size: The maximum number of conversational turns to keep in active memory (RAM).
"""
self.ram_size = ram_size
self.active_memory = deque()
self.passive_memory = {}
self.turn_count = 0
def add_message(self, user_input: str, ai_response: str):
"""Adds a turn to active memory, paging out the oldest turn to passive memory if RAM is full."""
turn_id = self.turn_count
turn_data = f"User: {user_input}\nAI: {ai_response}"
if len(self.active_memory) >= self.ram_size:
lru_turn_id, lru_turn_data = self.active_memory.popleft()
self.passive_memory[lru_turn_id] = lru_turn_data
print(f"--- [OS Memory: Paging out Turn {lru_turn_id} to passive storage.] ---")
self.active_memory.append((turn_id, turn_data))
self.turn_count += 1
def get_context(self, query: str) -> str:
"""Provides RAM context and simulates a 'page fault' to pull from passive memory if needed."""
active_context = "\n".join([data for _, data in self.active_memory])
paged_in_context = ""
for turn_id, data in self.passive_memory.items():
if any(word in data.lower() for word in query.lower().split() if len(word) > 3):
paged_in_context += f"\n(Paged in from Turn {turn_id}): {data}"
print(f"--- [OS Memory: Page fault! Paging in Turn {turn_id} from passive storage.] ---")
return f"### Active Memory (RAM):\n{active_context}\n\n### Paged-In from Passive Memory (Disk):\n{paged_in_context}"
def clear(self):
"""Clears both active and passive memory stores."""
self.active_memory.clear()
self.passive_memory = {}
self.turn_count = 0
print("OS-like memory cleared.")
__init__(...): Sets up an active_memory deque with a fixed size and an empty passive_memory dictionary.
add_message(...): Adds new turns to active_memory. If active_memory is full, it calls popleft() to get the oldest turn and moves it into the passive_memory dictionary. This is "paging out."
get_context(...): Always includes the active_memory. It then performs a search on passive_memory. If it finds a match for the query, it "pages in" that data by adding it to the context.
Let’s run a scenario where the agent is told a secret code. We’ll then have enough conversation to force that secret code to be “paged out” to passive memory. Finally, we’ll ask for the code and see if the agent can trigger a “page fault” to retrieve it.
You’ll see two key logs:
[Paging out Turn 0] after the third turn
[Page fault! Paging in Turn 0] when we ask the final question
os_memory = OSMemory(ram_size=2)
agent = AIAgent(memory_strategy=os_memory)
agent.chat("The secret launch code is 'Orion-Delta-7'.")
agent.chat("The weather for the launch looks clear.")
agent.chat("The launch window opens at 0400 Zulu.")
agent.chat("I need to confirm the launch code.")
As shown previously, we can now run this multi-turn chat conversation with our AI agent. This is the output we get.
############ OUTPUT ############
...
==== NEW INTERACTION ====
User: The launch window opens at 0400 Zulu.
--- [OS Memory: Paging out Turn 0 to passive storage.] ---
Bot: PROCESSING NEW LAUNCH WINDOW INFORMATION...
...
==== NEW INTERACTION ====
User: I need to confirm the launch code.
--- [OS Memory: Page fault! Paging in Turn 0 from passive storage.] ---
--- Agent Debug Info ---
[Full Prompt Sent to LLM]:
---
SYSTEM: You are a helpful AI assistant.
USER: ### MEMORY CONTEXT
### Active Memory (RAM):
User: The weather for the launch looks clear.
...
User: The launch window opens at 0400 Zulu.
...
### Paged-In from Passive Memory (Disk):
(Paged in from Turn 0): User: The secret launch code is 'Orion-Delta-7'.
...
Bot: CONFIRMING LAUNCH CODE: The stored secret launch code is 'Orion-Delta-7'.
>>>> Tokens: 539 | Response Time: 2.56s
It works perfectly! The agent successfully moved the old, “cold” data to passive storage and then intelligently retrieved it only when the query demanded it.
This is a conceptually powerful model for building large-scale systems with virtually limitless memory while keeping the active context small and fast.
Choosing the Right Strategy
We have gone through nine distinct memory optimization strategies, from the simple to the highly complex. There is no single “best” strategy, the right choice is a careful balance of your agent’s needs, your budget, and your engineering resources.
Let’s understand when to choose what?
For simple, short-lived bots: Sequential or Sliding Window are perfect. They are easy to implement and get the job done.
For long, creative conversations: Summarization is a great choice to maintain the general flow without a massive token overhead.
For agents needing precise, long-term recall: Retrieval-Based memory is the industry standard. It’s powerful, scalable, and the foundation of most RAG applications.
For highly reliable personal assistants: Memory-Augmented or Hierarchical approaches provide a robust way to separate critical facts from conversational chatter.
For expert systems and knowledge bases: Graph-Based memory is unparalleled in its ability to reason about relationships between data points.[/li][/list]
The most powerful agents in production often use hybrid approaches, combining these techniques. You might use a hierarchical system where the long-term memory is a combination of both a vector database and a knowledge graph.
The key is to start with a clear understanding of what you need your agent to remember, for how long, and with what level of precision. By mastering these memory strategies, you can move beyond building simple chatbots and start creating truly intelligent agents that learn, remember, and perform better over time.
Source
As your conversation with the AI agent gets longer and deeper, it uses more memory.
This is due to components like previous context storage, tool calling, database searches, and other dependencies your AI agent relies on.
In this blog, we will code and evaluate 9 beginner-to-advanced memory optimization techniques for AI agents.
You will learn how to apply each technique, along with their advantages and drawbacks from simple sequential approaches to advanced, OS-like memory management implementations.

To keep things clear and practical, we will use a simple AI agent throughout the blog. This will help us observe the internal mechanics of each technique and make it easier to scale and implement these strategies in more complex systems.
All the code (theory + notebook) is available in my GitHub repo:
Setting up the Environment
To optimize and test different memory techniques for AI agents, we need to initialize several components before starting the evaluation. But before initializing, we first need to install the necessary Python libraries.
We will need:
openai: The client library for interacting with the LLM API.
numpy: For numerical operations, especially with embeddings.
faiss-cpu: A library from Facebook AI for efficient similarity search, which will power our retrieval memory. It's a perfect in-memory vector database.
networkx: For creating and managing the knowledge graph in our Graph-Based Memory strategy.
tiktoken: To accurately count tokens and manage context window limits.
Let’s install these modules.
pip install openai numpy faiss-cpu networkx tiktoken
Now we need to initialize the client module, which will be used to make LLM calls. Let’s do that.
import os
from openai import OpenAI
API_KEY = "YOUR_LLM_API_KEY"
BASE_URL = "https://api.studio.nebius.com/v1/"
client = OpenAI(
base_url=BASE_URL,
api_key=API_KEY
)
print("OpenAI client configured successfully.")
We will be using open-source models through an API provider such as Bnebius or Together AI. Next, we need to import and decide which open-source LLM will be used to create our AI agent.
import tiktoken
import time
GENERATION_MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"
EMBEDDING_MODEL = "BAAI/bge-multilingual-gemma2"
For the main tasks, we are using the LLaMA 3.1 8B Instruct model. Some of the optimizations depend on an embedding model, for which we will be using the Gemma-2-BGE multimodal embedding model.
Next, we need to define multiple helpers that will be used throughout this blog.
Creating Helper Functions
To avoid repetitive code and follow good coding practices, we will define three helper functions that will be used throughout this guide:
generate_text: Generates content based on the system and user prompts passed to the LLM.
generate_embeddings: Generates embeddings for retrieval-based approaches.
count_tokens: Counts the total number of tokens for each retrieval-based approach.
Let’s start by coding the first function, generate_text, which will generate text based on the given input prompt.
def generate_text(system_prompt: str, user_prompt: str) -> str:
"""
Calls the LLM API to generate a text response.
Args:
system_prompt: The instruction that defines the AI's role and behavior.
user_prompt: The user's input to which the AI should respond.
Returns:
The generated text content from the AI, or an error message.
"""
response = client.chat.completions.create(
model=GENERATION_MODEL,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
]
)
return response.choices[0].message.content
Our generate_text function takes two inputs: a system prompt and a user prompt. Based on our text generation model, LLaMA 3.1 8B, it generates a response using the client module.
Next, let’s code the generate_embeddings function. We have chosen the Gemma-2 model for this purpose, and we will use the same client module to generate embeddings.
def generate_embedding(text: str) -> list[float]:
"""
Generates a numerical embedding for a given text string using the embedding model.
Args:
text: The input string to be converted into an embedding.
Returns:
A list of floats representing the embedding vector, or an empty list on error.
"""
response = client.embeddings.create(
model=EMBEDDING_MODEL,
input=text
)
return response.data[0].embedding
Our embedding function returns the embedding of the given input text using the selected Gemma-2 model.
Now, we need one more function that will count tokens based on the entire AI and user chat history. This helps us understand the overall flow and how it has been optimized.
We will use the most common and modern tokenizer used in many LLM architectures, OpenAI cl100k_base, which is a Byte Pair Encoding (BPE) tokenizer.
BPE, in simpler terms, is a tokenization algorithm that efficiently splits text into sub-word units.
"lower", "lowest" → ["low", "er"], ["low", "est"]
So let’s initialize the tokenizer using the tiktoken module:
tokenizer = tiktoken.get_encoding("cl100k_base")
We can now create a function to tokenize the text and count the total number of tokens.
def count_tokens(text: str) -> int:
"""
Counts the number of tokens in a given string using the pre-loaded tokenizer.
Args:
text: The string to be tokenized.
Returns:
The integer count of tokens.
"""
return len(tokenizer.encode(text))
Great! Now that we have created all the helper functions, we can start exploring different techniques to learn and evaluate them.
Creating Foundational Agent and Memory Class
Now we need to create the core design structure of our agent so that it can be used throughout the guide. Regarding memory, there are three important components that play a key role in any AI agent:
Adding past messages to the AI agent’s memory to make the agent aware of the context.
Retrieving relevant content that helps the AI agent generate responses.
Clearing the AI agent’s memory after each strategy has been implemented.
Object-Oriented Programming (OOP) is the best way to build this memory-based feature, so let’s create that.
class BaseMemoryStrategy(abc.ABC):
"""Abstract Base Class for all memory strategies."""
@abc.abstractmethod
def add_message(self, user_input: str, ai_response: str):
"""
An abstract method that must be implemented by subclasses.
It's responsible for adding a new user-AI interaction to the memory store.
"""
pass
@abc.abstractmethod
def get_context(self, query: str) -> str:
"""
An abstract method that must be implemented by subclasses.
It retrieves and formats the relevant context from memory to be sent to the LLM.
The 'query' parameter allows some strategies (like retrieval) to fetch context
that is specifically relevant to the user's latest input.
"""
pass
@abc.abstractmethod
def clear(self):
"""
An abstract method that must be implemented by subclasses.
It provides a way to reset the memory, which is useful for starting new conversations.
"""
pass
We are using @abstractmethod, which is a common coding style when subclasses are reused with different implementations. In our case, each strategy (which is a subclass) includes a different kind of implementation, so it is necessary to use abstract methods in the design.
Now, based on the memory state we recently defined and the helper functions we’ve created, we can build our AI agent structure using OOP principles. Let’s code that and then understand the process.
class AIAgent:
"""The main AI Agent class, designed to work with any memory strategy."""
def __init__(self, memory_strategy: BaseMemoryStrategy, system_prompt: str = "You are a helpful AI assistant."):
"""
Initializes the agent.
Args:
memory_strategy: An instance of a class that inherits from BaseMemoryStrategy.
This determines how the agent will remember the conversation.
system_prompt: The initial instruction given to the LLM to define its persona and task.
"""
self.memory = memory_strategy
self.system_prompt = system_prompt
print(f"Agent initialized with {type(memory_strategy).__name__}.")
def chat(self, user_input: str):
"""
Handles a single turn of the conversation.
Args:
user_input: The latest message from the user.
"""
print(f"\n{'='*25} NEW INTERACTION {'='*25}")
print(f"User > {user_input}")
start_time = time.time()
context = self.memory.get_context(query=user_input)
retrieval_time = time.time() - start_time
full_user_prompt = f"### MEMORY CONTEXT\n{context}\n\n### CURRENT REQUEST\n{user_input}"
prompt_tokens = count_tokens(self.system_prompt + full_user_prompt)
print("\n--- Agent Debug Info ---")
print(f"Memory Retrieval Time: {retrieval_time:.4f} seconds")
print(f"Estimated Prompt Tokens: {prompt_tokens}")
print(f"\n[Full Prompt Sent to LLM]:\n---\nSYSTEM: {self.system_prompt}\nUSER: {full_user_prompt}\n---")
start_time = time.time()
ai_response = generate_text(self.system_prompt, full_user_prompt)
generation_time = time.time() - start_time
self.memory.add_message(user_input, ai_response)
print(f"\nAgent > {ai_response}")
print(f"(LLM Generation Time: {generation_time:.4f} seconds)")
print(f"{'='*70}")
So, our agent is based on 6 simple steps.
First it retrieves the context from memory based on the strategy we will be using, during the process how much time it takes and so.
Then it merges the retrieved memory context with the current user input, preparing it as a complete prompt for the LLM.
Then it prints some debug info, things like how many tokens the prompt might use and how long context retrieval took.
Then it sends the full prompt (system + user + context) to the LLM and waits for a response.
Then it updates the memory with this new interaction, so it’s available for future context.
And finally, it shows the AI’s response along with how long it took to generate, wrapping up this turn of the conversation.
Great! Now that we have coded every component, we can start understanding and implementing each of the memory optimization techniques.
Problem with Sequential Optimization Approach
The very first optimization approach is the most basic and simplest, commonly used by many developers. It was one of the earliest methods to manage conversation history, often used by early chatbots.
This method involves adding each new message to a running log and feeding the entire conversation back to the model every time. It creates a linear chain of memory, preserving everything that has been said so far. Let’s visualize it.

Sequential approach works like this …
User starts a conversation with the AI agent.
The agent responds.
This user-AI interaction (a “turn”) is saved as a single block of text.
For the next turn, the agent takes the entire history (Turn 1 + Turn 2 + Turn 3…) and combines it with the new user query.
This massive block of text is sent to the LLM to generate the next response.
Using our Memory class, we can now implement the sequential optimization approach. Let's code that.
class SequentialMemory(BaseMemoryStrategy):
def __init__(self):
"""Initializes the memory with an empty list to store conversation history."""
self.history = []
def add_message(self, user_input: str, ai_response: str):
"""
Adds a new user-AI interaction to the history.
Each interaction is stored as two dictionary entries in the list.
"""
self.history.append({"role": "user", "content": user_input})
self.history.append({"role": "assistant", "content": ai_response})
def get_context(self, query: str) -> str:
"""
Retrieves the entire conversation history and formats it into a single
string to be used as context for the LLM. The 'query' parameter is ignored
as this strategy always returns the full history.
"""
return "\n".join([f"{turn['role'].capitalize()}: {turn['content']}" for turn in self.history])
def clear(self):
"""Resets the conversation history by clearing the list."""
self.history = []
print("Sequential memory cleared.")
Now you might understand what our base Memory class is doing here. Our subclasses (each approach) will implement the same abstract methods that we define throughout the guide.
Let’s quickly go over the code to understand how it works.
__init__(self): Initializes an empty self.history list to store the conversation.
add_message(...): Adds the user's input and AI's response to the history.
get_context(...): Formats and joins the history into a single "Role: Content" string as context.
clear(): Resets the history for a new conversation.
We can initialize the memory class and build the AI agent on top of it.
sequential_memory = SequentialMemory()
agent = AIAgent(memory_strategy=sequential_memory)
To test our sequential approach, we need to create a multi-turn chat conversation. Let’s do that.
agent.chat("Hi there! My name is Sam.")
agent.chat("I'm interested in learning about space exploration.")
agent.chat("What was the first thing I told you?")
==== NEW INTERACTION ====
User: Hi there! My name is Sam.
Bot: Hello Sam! Nice to meet you. What brings you here today?
>>>> Tokens: 23 | Response Time: 2.25s
==== NEW INTERACTION ====
User: I am interested in learning about space exploration.
Bot: Awesome! Are you curious about:
- Mars missions
- Space agencies
- Private companies (e.g., SpaceX)
- Space tourism
- Search for alien life?
...
>>>> Tokens: 92 | Response Time: 4.46s
==== NEW INTERACTION ====
User: What was the first thing I told you?
Bot: You said, "Hi there! My name is Sam."
...
>>>> Tokens: 378 | Response Time: 0.52s
The conversation is pretty smooth, but if you pay attention to the token calculation, you’ll notice that it gets bigger and bigger after each turn. Our agent isn’t dependent on any external tool that would significantly increase the token size, so this growth is purely due to the sequential accumulation of messages.
While the sequential approach is easy to implement, it has a major drawback:
The bigger your agent conversation gets, the more expensive the token cost becomes, so a sequential approach is quite costly.
Sliding Window Approach
To avoid the issue of a large context, the next approach we will focus on is the sliding window approach, where our agent doesn’t need to remember all previous messages, but only the context from a certain number of recent messages.
Instead of retaining the entire conversation history, the agent keeps only the most recent N messages as context. As new messages arrive, the oldest ones are dropped, and the window slides forward.
Sliding Window Approach (Created by )
The process is simple:
Define a fixed window size, say N = 2 turns.
The first two turns fill up the window.
When the third turn happens, the very first turn is pushed out of the window to make space.
The context sent to the LLM is only what’s currently inside the window.
Now, we can implement the Sliding Window Memory class.
class SlidingWindowMemory(BaseMemoryStrategy):
def __init__(self, window_size: int = 4):
"""
Initializes the memory with a deque of a fixed size.
Args:
window_size: The number of conversational turns to keep in memory.
A single turn consists of one user message and one AI response.
"""
self.history = deque(maxlen=window_size)
def add_message(self, user_input: str, ai_response: str):
"""
Adds a new conversational turn to the history. If the deque is full,
the oldest turn is automatically removed.
"""
self.history.append([
{"role": "user", "content": user_input},
{"role": "assistant", "content": ai_response}
])
def get_context(self, query: str) -> str:
"""
Retrieves the conversation history currently within the window and
formats it into a single string. The 'query' parameter is ignored.
"""
context_list = []
for turn in self.history:
for message in turn:
context_list.append(f"{message['role'].capitalize()}: {message['content']}")
return "\n".join(context_list)
Our sequential and sliding memory classes are quite similar. The key difference is that we’re adding a window to our context. Let’s quickly go through the code.
__init__(self, window_size=2): Sets up a deque with a fixed size, enabling automatic sliding of the context window.
add_message(...): Adds a new turn, old entries are dropped when the deque is full.
get_context(...): Builds the context from only the messages within the current sliding window.
Let’s initialize the sliding window state memory and build the AI agent on top of it.
sliding_memory = SlidingWindowMemory(window_size=2)
agent = AIAgent(memory_strategy=sliding_memory)
We are using a small window size of 2, which means the agent will remember only the last two messages. To test this optimization approach, we need a multi-turn conversation. So, let’s first try a straightforward conversation.
agent.chat("My name is Priya and I'm a software developer.")
agent.chat("I work primarily with Python and cloud technologies.")
agent.chat("My favorite hobby is hiking.")
==== NEW INTERACTION ====
User: My name is Priya and I am a software developer.
Bot: Nice to meet you, Priya! What can I assist you with today?
>>>> Tokens: 27 | Response Time: 1.10s
==== NEW INTERACTION ====
User: I work primarily with Python and cloud technologies.
Bot: That is great! Given your expertise...
>>>> Tokens: 81 | Response Time: 1.40s
==== NEW INTERACTION ====
User: My favorite hobby is hiking.
Bot: It seems we had a nice conversation about your background...
>>>> Tokens: 167 | Response Time: 1.59s
The conversation is quite similar and simple, just like we saw earlier in the sequential approach. However, now if the user asks the agent something that doesn’t exist within the sliding window, let’s observe the expected output.
agent.chat("What is my name?")
==== NEW INTERACTION ====
User: What is my name?
Bot: I apologize, but I dont have access to your name from our recent
conversation. Could you please remind me?
>>>> Tokens: 197 | Response Time: 0.60s
The AI agent couldn’t answer the question because the relevant context was outside the sliding window. However, we did see a reduction in token count due to this optimization.
The downside is clear, important context may be lost if the user refers back to earlier information. The sliding window is a crucial factor to consider and should be tailored based on the specific type of AI agent we are building.
Summarization Based Optimization
As we’ve seen earlier, the sequential approach suffers from a gigantic context issue, while the sliding window approach risks losing important context.
Therefore, there’s a need for an approach that can address both problems, by compacting the context without losing essential information. This can be achieved through summarization.
Summarization Approach (Created by )
Instead of simply dropping old messages, this strategy periodically uses the LLM itself to create a running summary of the conversation. It works like this:
Recent messages are stored in a temporary holding area, called a “buffer”.
Once this buffer reaches a certain size (a “threshold”), the agent pauses and triggers a special action.
It sends the contents of the buffer, along with the previous summary, to the LLM with a specific instruction: “Create a new, updated summary that incorporates these recent messages”.
The LLM generates a new, consolidated summary. This new summary replaces the old one, and the buffer is cleared.
Let’s implement the summarization optimization approach and observe how it affects the agent’s performance.
class SummarizationMemory(BaseMemoryStrategy):
def __init__(self, summary_threshold: int = 4):
"""
Initializes the summarization memory.
Args:
summary_threshold: The number of messages (user + AI) to accumulate in the
buffer before triggering a summarization.
"""
self.running_summary = ""
self.buffer = []
self.summary_threshold = summary_threshold
def add_message(self, user_input: str, ai_response: str):
"""
Adds a new user-AI interaction to the buffer. If the buffer size
reaches the threshold, it triggers the memory consolidation process.
"""
self.buffer.append({"role": "user", "content": user_input})
self.buffer.append({"role": "assistant", "content": ai_response})
if len(self.buffer) >= self.summary_threshold:
self._consolidate_memory()
def _consolidate_memory(self):
"""
Uses the LLM to summarize the contents of the buffer and merge it
with the existing running summary.
"""
print("\n--- [Memory Consolidation Triggered] ---")
buffer_text = "\n".join([f"{msg['role'].capitalize()}: {msg['content']}" for msg in self.buffer])
summarization_prompt = (
f"You are a summarization expert. Your task is to create a concise summary of a conversation. "
f"Combine the 'Previous Summary' with the 'New Conversation' into a single, updated summary. "
f"Capture all key facts, names, and decisions.\n\n"
f"### Previous Summary:\n{self.running_summary}\n\n"
f"### New Conversation:\n{buffer_text}\n\n"
f"### Updated Summary:"
)
new_summary = generate_text("You are an expert summarization engine.", summarization_prompt)
self.running_summary = new_summary
self.buffer = []
print(f"--- [New Summary: '{self.running_summary}'] ---")
def get_context(self, query: str) -> str:
"""
Constructs the context to be sent to the LLM. It combines the long-term
running summary with the short-term buffer of recent messages.
The 'query' parameter is ignored as this strategy provides a general context.
"""
buffer_text = "\n".join([f"{msg['role'].capitalize()}: {msg['content']}" for msg in self.buffer])
return f"### Summary of Past Conversation:\n{self.running_summary}\n\n### Recent Messages:\n{buffer_text}"
Our summarization memory component is a bit different compared to the previous approaches. Let’s break down and understand the component we’ve just coded.
__init__(...): Sets up an empty running_summary string and an empty buffer list.
add_message(...): Adds messages to the buffer. If the buffer size meets our summary_threshold, it calls the private _consolidate_memory method.
_consolidate_memory(): This is the new, important part. It formats the buffer content and the existing summary into a special prompt, asks the LLM to create a new summary, updates self.running_summary, and clears the buffer.
get_context(...): Provides the LLM with both the long-term summary and the short-term buffer, giving it a complete picture of the conversation.
Let’s initialize the summary memory component and build the AI agent on top of it.
summarization_memory = SummarizationMemory(summary_threshold=4)
agent = AIAgent(memory_strategy=summarization_memory)
The initialization is done in the same way as we saw earlier. We’ve set the summary threshold to 4, which means after every 2 turns, a summary will be generated and passed as context to the AI agent, instead of the entire or sliding window conversation history.
This aligns with the core goal of the summarization approach, saving tokens while retaining important information.
Let’s test this approach and evaluate how efficient it is in terms of token usage and preserving relevant context.
agent.chat("I'm starting a new company called 'Innovatech'. Our focus is on sustainable energy.")
agent.chat("Our first product will be a smart solar panel, codenamed 'Project Helios'.")
==== NEW INTERACTION ====
User: I am starting a new company called 'Innovatech'. Ou...
Bot: Congratulations on starting Innovatech! Focusing o ...
>>>> Tokens: 45 | Response Time: 2.55s
==== NEW INTERACTION ====
User: Our first product will be a smart solar panel....
--- [Memory Consolidation Triggered] ---
--- [New Summary: The user started a compan ...
Bot: That is exciting news about ....
>>>> Tokens: 204 | Response Time: 3.58s
So far, we’ve had two basic conversation turns. Since we’ve set the summary generator parameter to 2, a summary will now be generated for those previous turns.
Let’s proceed with the next turn and observe the impact on token usage.
agent.chat("The marketing budget is set at $50,000.")
agent.chat("What is the name of my company and its first product?")
...
==== NEW INTERACTION ====
User: What is the name of my company and its first product?
Bot: Your company is called 'Innovatech' and its first product is codenamed 'Project Helios'.
>>>> Tokens: 147 | Response Time: 1.05s
Did you notice that in our fourth conversation, the token count dropped to nearly half of what we saw in the sequential and sliding window approaches? That’s the biggest advantage of the summarization approach, it greatly reduces token usage.
However, for it to be truly effective, your summarization prompts need to be carefully crafted to ensure they capture the most important details.
The main downside is that critical information can still be lost in the summarization process. For example, if you continue a conversation for up to 40 turns and include numeric or factual details, such as balance sheet data, there’s a risk that earlier key info (like the gross sales mentioned in the 4th turn) may not appear in the summary anymore.
Let’s take a look at this example, where you had a 40-turn conversation with the AI agent and included several numeric details.
The summary used as context failed to include the gross sales figure from the 4th conversation, which is a clear limitation of this approach.
agent.chat("what was the gross sales of our company in the fiscal year?")
...
==== NEW INTERACTION ====
User: what was the gross sales of our company in the fiscal year?
Bot: I am sorry but I do not have that information. Could you please provide the gross sales figure for the fiscal year?
>>>> Tokens: 1532 | Response Time: 2.831s
You can see that although the summarized information uses fewer tokens, the answer quality and accuracy can decrease significantly or even drop to zero because of problematic context being passed to the AI agent.
This highlights the importance of creating a sub-agent dedicated to fact-checking the LLM’s responses. Such a sub-agent can verify factual accuracy and help make the overall agent more reliable and powerful.
Retrieval Based Memory
This is the most powerful strategy used in many AI agent use cases: RAG-based AI agents. As we saw earlier, previous approaches reduce token usage but risk losing relevant context. RAG, however, is different it retrieves relevant context based on the current user query.
The context is stored in a database, where embedding models play a crucial role by transforming text into vector representations that make retrieval efficient.
Let’s visualize how this process works.

Let’s understand the workflow of RAG-based memory:
Every time a new interaction happens, it’s not just stored in a list, it’s saved as a “document” in a specialized database. We also generate a numerical representation of this document’s meaning, called an embedding, and store it.
When the user sends a new message, the agent first converts this new message into an embedding as well.
It then uses this query embedding to perform a similarity search against all the document embeddings stored in its memory database.
The system retrieves the top k most semantically relevant documents (e.g., the 3 most similar past conversation turns).
Finally, only these highly relevant, retrieved documents are injected into the LLM’s context window.
We will be using FAISS for vector storage in this approach. Let’s code this memory component.
import numpy as np
import faiss
class RetrievalMemory(BaseMemoryStrategy):
def __init__(self, k: int = 2, embedding_dim: int = 3584):
"""
Initializes the retrieval memory system.
Args:
k: The number of top relevant documents to retrieve for a given query.
embedding_dim: The dimension of the vectors generated by the embedding model.
For BAAI/bge-multilingual-gemma2, this is 3584.
"""
self.k = k
self.embedding_dim = embedding_dim
self.documents = []
self.index = faiss.IndexFlatL2(self.embedding_dim)
def add_message(self, user_input: str, ai_response: str):
"""
Adds a new conversational turn to the memory. Each part of the turn (user
input and AI response) is embedded and indexed separately for granular retrieval.
"""
docs_to_add = [
f"User said: {user_input}",
f"AI responded: {ai_response}"
]
for doc in docs_to_add:
embedding = generate_embedding(doc)
if embedding:
self.documents.append(doc)
vector = np.array([embedding], dtype='float32')
self.index.add(vector)
def get_context(self, query: str) -> str:
"""
Finds the k most relevant documents from memory based on semantic
similarity to the user's query.
"""
if self.index.ntotal == 0:
return "No information in memory yet."
query_embedding = generate_embedding(query)
if not query_embedding:
return "Could not process query for retrieval."
query_vector = np.array([query_embedding], dtype='float32')
distances, indices = self.index.search(query_vector, self.k)
retrieved_docs = [self.documents for i in indices[0] if i != -1]
if not retrieved_docs:
return "Could not find any relevant information in memory."
return "### Relevant Information Retrieved from Memory:\n" + "\n---\n".join(retrieved_docs)
Let’s go through what’s happening in the code.
__init__(...): We initialize a list for our text documents and a faiss.IndexFlatL2 to store and search our vectors. We must specify the embedding_dim, which is the size of the vectors our embedding model produces.
add_message(...): For each turn, we generate an embedding for both the user and AI messages, add the text to our documents list, and add the corresponding vector to our FAISS index.
get_context(...): This is important. It embeds the user's query, uses self.index.search to find the k most similar vectors, and then uses their indices to pull the original text from our documents list. This retrieved text becomes the context.
As before, we initialize our memory state and build the AI agent using it.
retrieval_memory = RetrievalMemory(k=2)
agent = AIAgent(memory_strategy=retrieval_memory)
We are setting k = 2, which means we fetch only two relevant chunks related to the user's query. When dealing with larger datasets, we typically set k to a higher value such as 5, 7, or even more especially if the chunk size is very small.
Let's test our AI agent with this setup.
agent.chat("I am planning a vacation to Japan for next spring.")
agent.chat("For my software project, I'm using the React framework for the frontend.")
agent.chat("I want to visit Tokyo and Kyoto while I'm on my trip.")
agent.chat("The backend of my project will be built with Django.")
...
==== NEW INTERACTION ====
User: I want to visit Tokyo and Kyoto while I'm on my trip.
Bot: You're interested in visiting Tokyo and Kyoto...
...
These are just basic conversations that we typically run with an AI agent. Now, let’s try a newer conversation based on past information and see how well the relevant context is retrieved and how optimized the token usage is in that scenario.
agent.chat("What cities am I planning to visit on my vacation?")
==== NEW INTERACTION ====
User: What cities am I planning to visit on my vacation?
--- Agent Debug Info ---
[Full Prompt Sent to LLM]:
---
SYSTEM: You are a helpful AI assistant.
USER: MEMORY CONTEXT
Relevant Information Retrieved from Memory:
User said: I want to visit Tokyo and Kyoto while I am on my trip.
---
User said: I am planning a vacation to Japan for next spring.
...
Bot: You are planning to visit Tokyo and Kyoto while on your vacation to Japan next spring.
>>>> Tokens: 65 | Response Time: 0.53s
You can see that the relevant context has been successfully fetched, and the token count is extremely low because we’re retrieving only the pertinent information.
The choice of embedding model and the vector storage database plays a crucial role here. Optimizing that database is another important step to ensure fast and accurate retrieval. FAISS is a popular choice because it offers these capabilities.
However, the downside is that this approach is more complex to implement than it seems. As the database grows larger, the AI agent’s complexity increases significantly.
You’ll likely need parallel query processing and other optimization techniques to maintain performance. Despite these challenges, this approach remains the industry standard for optimizing AI agents.
Memory Augmented Transformers
Beyond these core strategies, AI systems are implementing even more sophisticated approaches that push the boundaries of what’s possible.
We can understand this technique through an example, imagine a regular AI like a student with just one small notepad. They can only write a little bit at a time. So in a long test, they have to erase old notes to make room for new ones.
Now, memory-augmented transformers are like giving that student a bunch of sticky notes. The notepad still handles the current work, but the sticky notes help them save key info from earlier.
For example: you’re designing a video game with an AI. Early on, you say you want it to be set in space with no violence. Normally, that would get forgotten after a long talk. But with memory, the AI writes “space setting, no violence” on a sticky note.
Later, when you ask, “What characters would fit our game?”, it checks the note and gives ideas that match your original vision, even hours later.
It’s like having a smart helper who remembers the important stuff without needing you to repeat it.
Let’s visualize this:

We will create a memory class that:
Uses a SlidingWindowMemory for recent chat.
After each turn, uses the LLM to act as a “fact extractor.” It will analyze the conversation and decide if it contains a core fact, preference, or decision.
If an important fact is found, it’s stored as a memory token (a concise string) in a separate list.
The final context provided to the agent is a combination of the recent chat window and all the persistent memory tokens.
class MemoryAugmentedMemory(BaseMemoryStrategy):
def __init__(self, window_size: int = 2):
"""
Initializes the memory-augmented system.
Args:
window_size: The number of recent turns to keep in the short-term memory.
"
""
self.recent_memory = SlidingWindowMemory(window_size=window_size)
self.memory_tokens = []
def add_message(self, user_input: str, ai_response: str):
"""
Adds the latest turn to recent memory and then uses an LLM call to decide
if a new, persistent memory token should be created from this interaction.
"""
self.recent_memory.add_message(user_input, ai_response)
fact_extraction_prompt = (
f"Analyze the following conversation turn. Does it contain a core fact, preference, or decision that should be remembered long-term? "
f"Examples include user preferences ('I hate flying'), key decisions ('The budget is $1000'), or important facts ('My user ID is 12345').\n\n"
f"Conversation Turn:\nUser: {user_input}\nAI: {ai_response}\n\n"
f"If it contains such a fact, state the fact concisely in one sentence. Otherwise, respond with 'No important fact.'"
)
extracted_fact = generate_text("You are a fact-extraction expert.", fact_extraction_prompt)
if "no important fact" not in extracted_fact.lower():
print(f"--- [Memory Augmentation: New memory token created: '{extracted_fact}'] ---")
self.memory_tokens.append(extracted_fact)
def get_context(self, query: str) -> str:
"""
Constructs the context by combining the short-term recent conversation
with the list of all long-term, persistent memory tokens.
"""
recent_context = self.recent_memory.get_context(query)
memory_token_context = "\n".join([f"- {token}" for token in self.memory_tokens])
return f"### Key Memory Tokens (Long-Term Facts):\n{memory_token_context}\n\n### Recent Conversation:\n{recent_context}"
Our augmented class might be confusing at first glance, but let’s understand this:
__init__(...): Initializes both a SlidingWindowMemory instance and an empty list for memory_tokens.
add_message(...): This method now has two jobs. It adds the turn to the sliding window and makes an extra LLM call to see if a key fact should be extracted and added to self.memory_tokens.
get_context(...): Constructs a rich prompt by combining the "sticky notes" (memory_tokens) with the recent chat history from the sliding window.
Let’s initialize this memory-augmented state and AI agent.
mem_aug_memory = MemoryAugmentedMemory(window_size=2)
agent = AIAgent(memory_strategy=mem_aug_memory)
We are using a window size of 2, just as we set previously. Now, we can simply test this approach using a multi-turn chat conversation and see how well it performs.
agent.chat("Please remember this for all future interactions: I am severely allergic to peanuts.")
agent.chat("Okay, let's talk about recipes. What's a good idea for dinner tonight?")
agent.chat("That sounds good. What about a dessert option?")
==== NEW INTERACTION ====
User: Please remember this for all future interactions: I am severely allergic to peanuts.
--- [Memory Augmentation: New memory token created: 'The user has a severe allergy to peanuts.'] ---
Bot: I have taken note of your long-term fact: You are severely allergic to peanuts. I will keep this in mind...
>>>> Tokens: 45 | Response Time: 1.32s
...
The conversation is the same as with an ordinary AI agent. Now, let’s test the memory-augmented technique by including a new method.
agent.chat("Could you suggest a Thai green curry recipe? Please ensure it's safe for me.")
==== NEW INTERACTION ====
User: Could you suggest a Thai green curry recipe? Please ensure it is safe for me.
--- Agent Debug Info ---
[Full Prompt Sent to LLM]:
---
SYSTEM: You are a helpful AI assistant.
USER: MEMORY CONTEXT
Key Memory Tokens (Long-Term Facts):
- The user has a severe allergy to peanuts.
...
Recent Conversation:
User: Okay, lets talk about recipes...
...
Bot: Of course. Given your peanut allergy, it is very important to be careful with Thai cuisine as many recipes use peanuts or peanut oil. Here is a peanut-free Thai green curry recipe...
>>>> Tokens: 712 | Response Time: 6.45s
This approach can be deeply evaluated on a larger dataset in a better way since the transformer model used here requires many confidential solutions; this approach might be a better option.
It is a more complex and expensive strategy due to the extra LLM calls for fact extraction, but its ability to retain critical information over long, evolving conversations makes it incredibly powerful for building truly reliable and intelligent personal assistants.
Hierarchical Optimization for Multi-tasks
So far, we have treated memory as a single system. But what if we could build an agent that thinks more like a human, with different types of memory for different purposes?
This is the idea behind Hierarchical Memory. It’s a composite strategy that combines multiple, simpler memory types into a layered system, creating a more sophisticated and organized mind for our agent.
Think about how you remember things:
Working Memory: The last few sentences someone said to you. It’s fast, but fleeting.
Short-Term Memory: The main points from a meeting you had this morning. You can recall them easily for a few hours.
Long-Term Memory: Your home address or a critical fact you learned years ago. It’s durable and deeply ingrained.

Hierarchical approach works like this …
It starts with capturing the user message into working memory.
Then it checks if the information is important enough to promote to long-term memory.
After that, promoted content is stored in a retrieval memory for future use.
On new queries, it searches long-term memory for relevant context.
Finally, it injects relevant memories into context to generate better responses.
Let’s build this component.
class HierarchicalMemory(BaseMemoryStrategy):
def __init__(self, window_size: int = 2, k: int = 2, embedding_dim: int = 3584):
"""
Initializes the hierarchical memory system.
Args:
window_size: The size of the short-term working memory (in turns).
k: The number of documents to retrieve from long-term memory.
embedding_dim: The dimension of the embedding vectors for long-term memory.
"""
print("Initializing Hierarchical Memory...")
self.working_memory = SlidingWindowMemory(window_size=window_size)
self.long_term_memory = RetrievalMemory(k=k, embedding_dim=embedding_dim)
self.promotion_keywords = ["remember", "rule", "preference", "always", "never", "allergic"]
def add_message(self, user_input: str, ai_response: str):
"""
Adds a message to working memory and conditionally promotes it to long-term
memory based on its content.
"""
self.working_memory.add_message(user_input, ai_response)
if any(keyword in user_input.lower() for keyword in self.promotion_keywords):
print(f"--- [Hierarchical Memory: Promoting message to long-term storage.] ---")
self.long_term_memory.add_message(user_input, ai_response)
def get_context(self, query: str) -> str:
"""
Constructs a rich context by combining relevant information from both
the long-term and short-term memory layers.
"""
working_context = self.working_memory.get_context(query)
long_term_context = self.long_term_memory.get_context(query)
return f"### Retrieved Long-Term Memories:\n{long_term_context}\n\n### Recent Conversation (Working Memory):\n{working_context}"
So …
__init__(...): Initializes an instance of SlidingWindowMemory and an instance of RetrievalMemory. It also defines a list of promotion_keywords.
add_message(...): Adds every message to the short-term working_memory. It then checks if the user_input contains any of the special keywords. If it does, the message is also added to the long_term_memory.
get_context(...): This is where the hierarchy comes together. It fetches context from both memory systems and combines them into one rich prompt, giving the LLM both recent conversational flow and relevant deep facts.
Let’s now initialize the memory component and AI agent.
hierarchical_memory = HierarchicalMemory()
agent = AIAgent(memory_strategy=hierarchical_memory)
We can now create a multi-turn chat conversation for this technique.
agent.chat("Please remember my User ID is AX-7890.")
agent.chat("Let's chat about the weather. It's very sunny today.")
agent.chat("I'm planning to go for a walk later.")
agent.chat("I need to log into my account, can you remind me of my ID?")
We are testing this with a scenario where the user provides an important piece of information (a User ID) using a keyword (“remember”).
Then, we now have a few turns of unrelated chat. In the last tern we are asking the agent to recall the ID. let’s look at the output of the ai agent.
==== NEW INTERACTION ====
User: Please remember my User ID is AX-7890.
--- [Hierarchical Memory: Promoting message to long-term storage.] ---
Bot: You have provided your User ID as AX-7890, which has been stored in long-term memory for future reference.
...
==== NEW INTERACTION ====
User: I need to log into my account, can you remind me of my ID?
--- Agent Debug Info ---
[Full Prompt Sent to LLM]:
---
SYSTEM: You are a helpful AI assistant.
USER:
User said: Please remember my User ID is AX-7890.
...
User: Let's chat about the weather...
User: I'm planning to go for a walk later...
Bot: Your User ID is AX-7890. You can use this to log into your account. Is there anything else I can assist you with?
>>>> Tokens: 452 | Response Time: 2.06s
As you can see, the agent successfully combines different memory types. It uses the fast working memory for the flow of conversation but correctly queries its deep, long-term memory to retrieve the critical User ID when asked.
This hybrid approach is a powerful pattern for building sophisticated agents.
Graph Based Optimization
So far, our memory has stored information as chunks of text, whether it’s the full conversation, a summary, or a retrieved document. But what if we could teach our agent to understand the relationships between different pieces of information? This is the leap we take with Graph-Based Memory.
This strategy moves beyond storing unstructured text and represents information as a knowledge graph.
A knowledge graph consists of:
Nodes (or Entities): These are the "things" in our conversation, like people (Clara), companies (FutureScape), or concepts (Project Odyssey).
Edges (or Relations): These are the connections that describe how the nodes relate to each other, like works_for, is_based_in, or manages.
The result is a structured, web-like memory. Instead of a simple fact like "Clara works for FutureScape," the agent stores a connection: (Clara) --[works_for]--> (FutureScape).

This is incredibly powerful for answering complex queries that require reasoning about connections. The main challenge is populating the graph from unstructured conversation.
For this, we can use a powerful technique: using the LLM itself as a tool to extract structured (Subject, Relation, Object) triples from the text.
For our implementation, we’ll use the networkx library to build and manage our graph. The core of this strategy will be a helper method, _extract_triples, that calls the LLM with a specific prompt to convert conversational text into structured (Subject, Relation, Object) data.
class GraphMemory(BaseMemoryStrategy):
def __init__(self):
"""Initializes the memory with an empty NetworkX directed graph."""
self.graph = nx.DiGraph()
def _extract_triples(self, text: str) -> list[tuple[str, str, str]]:
"""
Uses the LLM to extract knowledge triples (Subject, Relation, Object) from a given text.
This is a form of "LLM as a Tool" where the model's language understanding is
used to create structured data.
"""
print("--- [Graph Memory: Attempting to extract triples from text.] ---")
extraction_prompt = (
f"You are a knowledge extraction engine. Your task is to extract Subject-Relation-Object triples from the given text. "
f"Format your output strictly as a list of Python tuples. For example: [('Sam', 'works_for', 'Innovatech'), ('Innovatech', 'focuses_on', 'Energy')]. "
f"If no triples are found, return an empty list [].\n\n"
f"Text to analyze:\n\"""{text}\""""
)
response_text = generate_text("You are an expert knowledge graph extractor.", extraction_prompt)
try:
found_triples = re.findall(r"\(['\"](.*?)['\"],\s*['\"](.*?)['\"],\s*['\"](.*?)['\"]\)", response_text)
print(f"--- [Graph Memory: Extracted triples: {found_triples}] ---")
return found_triples
except Exception as e:
print(f"Could not parse triples from LLM response: {e}")
return []
def add_message(self, user_input: str, ai_response: str):
"""Extracts triples from the latest conversation turn and adds them to the knowledge graph."""
full_text = f"User: {user_input}\nAI: {ai_response}"
triples = self._extract_triples(full_text)
for subject, relation, obj in triples:
self.graph.add_edge(subject.strip(), obj.strip(), relation=relation.strip())
def get_context(self, query: str) -> str:
"""
Retrieves context by finding entities from the query in the graph and
returning all their known relationships.
"""
if not self.graph.nodes:
return "The knowledge graph is empty."
query_entities = [word.capitalize() for word in query.replace('?','').split() if word.capitalize() in self.graph.nodes]
if not query_entities:
return "No relevant entities from your query were found in the knowledge graph."
context_parts = []
for entity in set(query_entities):
for u, v, data in self.graph.out_edges(entity, data=True):
context_parts.append(f"{u} --[{data['relation']}]--> {v}")
for u, v, data in self.graph.in_edges(entity, data=True):
context_parts.append(f"{u} --[{data['relation']}]--> {v}")
return "### Facts Retrieved from Knowledge Graph:\n" + "\n".join(sorted(list(set(context_parts))))
_extract_triples(…): This is the engine of the strategy. It sends the conversation text to the LLM with a highly specific prompt, asking it to return structured data.
add_message(…): This method orchestrates the process. It calls _extract_triples on the new conversation turn and then adds the resulting subject-relation-object pairs as edges to the networkx graph.
get_context(…): This performs a simple search. It looks for entities from the user's query that exist as nodes in the graph. If it finds any, it retrieves all known relationships for those entities and provides them as structured context.
Let’s see if our agent can build a mental map of a scenario and then use it to answer a question that requires connecting the dots.
You’ll see the [Graph Memory: Extracted triples] log after each turn, showing how the agent is building its knowledge base in real-time
The final context won’t be conversational text but rather a structured list of facts retrieved from the graph.
graph_memory = GraphMemory()
agent = AIAgent(memory_strategy=graph_memory)
agent.chat("A person named Clara works for a company called 'FutureScape'.")
agent.chat("FutureScape is based in Berlin.")
agent.chat("Clara's main project is named 'Odyssey'.")
agent.chat("Tell me about Clara's project.")
The output we get after this multi-turn chat is:
############ OUTPUT ############
==== NEW INTERACTION ====
User: A person named Clara works for a company called 'FutureScape'.
--- [Graph Memory: Attempting to extract triples from text.] ---
--- [Graph Memory: Extracted triples: [('Clara', 'works_for', 'FutureScape')]] ---
Bot: Understood. I've added the fact that Clara works for FutureScape to my knowledge graph.
...
==== NEW INTERACTION ====
User: Clara's main project is named 'Odyssey'.
--- [Graph Memory: Attempting to extract triples from text.] ---
--- [Graph Memory: Extracted triples: [('Clara', 'manages_project', 'Odyssey')]] ---
Bot: Got it. I've noted that Clara's main project is Odyssey.
==== NEW INTERACTION ====
User: Tell me about Clara's project.
--- Agent Debug Info ---
[Full Prompt Sent to LLM]:
---
SYSTEM: You are a helpful AI assistant.
USER: ### MEMORY CONTEXT
### Facts Retrieved from Knowledge Graph:
Clara --[manages_project]--> Odyssey
Clara --[works_for]--> FutureScape
...
Bot: Based on my knowledge graph, Clara's main project is named 'Odyssey', and Clara works for the company FutureScape.
>>>> Tokens: 78 | Response Time: 1.5s
The agent didn’t just find a sentence containing “Clara” and “project”, it navigated its internal graph to present all known facts related to the entities in the query.
This opens the door to building highly knowledgeable expert agents.
Compression & Consolidation Memory
We have seen that summarization is a good way to manage long conversations, but what if we could be even more aggressive in cutting down token usage? This is where Compression & Consolidation Memory comes into play. It’s like summarization’s more intense sibling.
Instead of creating a narrative summary that tries to preserve the conversational flow, the goal here is to distill each piece of information into its most dense, factual representation.
Think of it like converting a long, verbose paragraph from a meeting transcript into a single, concise bullet point.

The process is straightforward:
After each conversational turn (user input + AI response), the agent sends this text to the LLM.
It uses a specific prompt that asks the LLM to act like a “data compression engine”.
The LLM’s task is to re-write the turn as a single, essential statement, stripping out all conversational fluff like greetings, politeness, and filler words.
This highly compressed fact is then stored in a simple list.
The memory of the agent becomes a lean, efficient list of core facts, which can be significantly more token-efficient than even a narrative summary.
class CompressionMemory(BaseMemoryStrategy):
def __init__(self):
"""Initializes the memory with an empty list to store compressed facts."""
self.compressed_facts = []
def add_message(self, user_input: str, ai_response: str):
"""Uses the LLM to compress the latest turn into a concise factual statement."""
text_to_compress = f"User: {user_input}\nAI: {ai_response}"
compression_prompt = (
f"You are a data compression engine. Your task is to distill the following text into its most essential, factual statement. "
f"Be as concise as possible, removing all conversational fluff. For example, 'User asked about my name and I, the AI, responded that my name is an AI assistant' should become 'User asked for AI's name.'\n\n"
f"Text to compress:\n\"{text_to_compress}\""
)
compressed_fact = generate_text("You are an expert data compressor.", compression_prompt)
print(f"--- [Compression Memory: New fact stored: '{compressed_fact}'] ---")
self.compressed_facts.append(compressed_fact)
def get_context(self, query: str) -> str:
"""Returns the list of all compressed facts, formatted as a bulleted list."""
if not self.compressed_facts:
return "No compressed facts in memory."
return "### Compressed Factual Memory:\n- " + "\n- ".join(self.compressed_facts)
__init__(...): Simply creates an empty list, self.compressed_facts.
add_message(...): The core logic. It takes the latest turn, sends it to the LLM with the compression prompt, and stores the concise result.
get_context(...): Formats the list of compressed facts into a clean, bulleted list to be used as context.
Let’s test this strategy with a simple planning conversation.
After each turn, you will see the [Compression Memory: New fact stored] log, showing the very short, compressed version of the interaction. Notice how the final context sent to the LLM is just a terse list of facts, which is highly token-efficient.
compression_memory = CompressionMemory()
agent = AIAgent(memory_strategy=compression_memory)
agent.chat("Okay, I've decided on the venue for the conference. It's going to be the 'Metropolitan Convention Center'.")
agent.chat("The date is confirmed for October 26th, 2025.")
agent.chat("Could you please summarize the key details for the conference plan?")
Once we perform this multi-turn chat conversation, we can take a look at the output. Let’s do that.
############ OUTPUT ############
==== NEW INTERACTION ====
User: Okay, I've decided on the venue for the conference. It's going to be the 'Metropolitan Convention Center'.
--- [Compression Memory: New fact stored: 'The conference venue has been decided as the 'Metropolitan Convention Center'.'] ---
Bot: Great! The Metropolitan Convention Center is an excellent choice. What's next on our planning list?
...
==== NEW INTERACTION ====
User: The date is confirmed for October 26th, 2025.
--- [Compression Memory: New fact stored: 'The conference date is confirmed for October 26th, 2025.'] ---
Bot: Perfect, I've noted the date.
...
==== NEW INTERACTION ====
User: Could you please summarize the key details for the conference plan?
--- Agent Debug Info ---
[Full Prompt Sent to LLM]:
---
SYSTEM: You are a helpful AI assistant.
USER: ### MEMORY CONTEXT
### Compressed Factual Memory:
- The conference venue has been decided as the 'Metropolitan Convention Center'.
- The conference date is confirmed for October 26th, 2025.
...
Bot: Of course. Based on my notes, here are the key details for the conference plan:
- **Venue:** Metropolitan Convention Center
- **Date:** October 26th, 2025
>>>> Tokens: 48 | Response Time: 1.2s
As you can see, this strategy is extremely effective at reducing token count while preserving core facts. It’s a great choice for applications where long-term factual recall is needed on a tight token budget.
However, for conversations that rely heavily on nuance and personality, this aggressive compression might be too much.
OS-Like Memory Management
What if we could build a memory system for our agent that works just like the memory in your computer?
This advanced concept borrows directly from how a computer’s Operating System (OS) manages RAM and a hard disk.
Let’s use an analogy:
RAM (Random Access Memory): This is the super-fast memory your computer uses for active programs. It’s expensive and you don’t have a lot of it. For our agent, the LLM’s context window is its RAM — it’s fast to access but very limited in size.
Hard Disk (or SSD): This is your computer’s long-term storage. It’s much larger and cheaper than RAM, but also slower to access. For our agent, this can be an external database or a simple file where we store old conversation history.

This memory strategy works by intelligently moving information between these two tiers:
Active Memory (RAM): The most recent conversation turns are kept here, in a small, fast-access buffer.
Passive Memory (Disk): When the active memory is full, the oldest information is moved out to the passive, long-term storage. This is called “paging out.”
Page Fault: When the user asks a question that requires information not currently in the active memory, a “page fault” occurs.
The system must then go to its passive storage, find the relevant information, and load it back into the active context for the LLM to use. This is called “paging in.”
Our simulation will create an active_memory (a deque, like a sliding window) and a passive_memory (a dictionary). When the active memory is full, we'll page out the oldest turn.
To page in, we will use a simple keyword search to simulate a retrieval from passive memory.
class OSMemory(BaseMemoryStrategy):
def __init__(self, ram_size: int = 2):
"""
Initializes the OS-like memory system.
Args:
ram_size: The maximum number of conversational turns to keep in active memory (RAM).
"""
self.ram_size = ram_size
self.active_memory = deque()
self.passive_memory = {}
self.turn_count = 0
def add_message(self, user_input: str, ai_response: str):
"""Adds a turn to active memory, paging out the oldest turn to passive memory if RAM is full."""
turn_id = self.turn_count
turn_data = f"User: {user_input}\nAI: {ai_response}"
if len(self.active_memory) >= self.ram_size:
lru_turn_id, lru_turn_data = self.active_memory.popleft()
self.passive_memory[lru_turn_id] = lru_turn_data
print(f"--- [OS Memory: Paging out Turn {lru_turn_id} to passive storage.] ---")
self.active_memory.append((turn_id, turn_data))
self.turn_count += 1
def get_context(self, query: str) -> str:
"""Provides RAM context and simulates a 'page fault' to pull from passive memory if needed."""
active_context = "\n".join([data for _, data in self.active_memory])
paged_in_context = ""
for turn_id, data in self.passive_memory.items():
if any(word in data.lower() for word in query.lower().split() if len(word) > 3):
paged_in_context += f"\n(Paged in from Turn {turn_id}): {data}"
print(f"--- [OS Memory: Page fault! Paging in Turn {turn_id} from passive storage.] ---")
return f"### Active Memory (RAM):\n{active_context}\n\n### Paged-In from Passive Memory (Disk):\n{paged_in_context}"
def clear(self):
"""Clears both active and passive memory stores."""
self.active_memory.clear()
self.passive_memory = {}
self.turn_count = 0
print("OS-like memory cleared.")
__init__(...): Sets up an active_memory deque with a fixed size and an empty passive_memory dictionary.
add_message(...): Adds new turns to active_memory. If active_memory is full, it calls popleft() to get the oldest turn and moves it into the passive_memory dictionary. This is "paging out."
get_context(...): Always includes the active_memory. It then performs a search on passive_memory. If it finds a match for the query, it "pages in" that data by adding it to the context.
Let’s run a scenario where the agent is told a secret code. We’ll then have enough conversation to force that secret code to be “paged out” to passive memory. Finally, we’ll ask for the code and see if the agent can trigger a “page fault” to retrieve it.
You’ll see two key logs:
[Paging out Turn 0] after the third turn
[Page fault! Paging in Turn 0] when we ask the final question
os_memory = OSMemory(ram_size=2)
agent = AIAgent(memory_strategy=os_memory)
agent.chat("The secret launch code is 'Orion-Delta-7'.")
agent.chat("The weather for the launch looks clear.")
agent.chat("The launch window opens at 0400 Zulu.")
agent.chat("I need to confirm the launch code.")
As shown previously, we can now run this multi-turn chat conversation with our AI agent. This is the output we get.
############ OUTPUT ############
...
==== NEW INTERACTION ====
User: The launch window opens at 0400 Zulu.
--- [OS Memory: Paging out Turn 0 to passive storage.] ---
Bot: PROCESSING NEW LAUNCH WINDOW INFORMATION...
...
==== NEW INTERACTION ====
User: I need to confirm the launch code.
--- [OS Memory: Page fault! Paging in Turn 0 from passive storage.] ---
--- Agent Debug Info ---
[Full Prompt Sent to LLM]:
---
SYSTEM: You are a helpful AI assistant.
USER: ### MEMORY CONTEXT
### Active Memory (RAM):
User: The weather for the launch looks clear.
...
User: The launch window opens at 0400 Zulu.
...
### Paged-In from Passive Memory (Disk):
(Paged in from Turn 0): User: The secret launch code is 'Orion-Delta-7'.
...
Bot: CONFIRMING LAUNCH CODE: The stored secret launch code is 'Orion-Delta-7'.
>>>> Tokens: 539 | Response Time: 2.56s
It works perfectly! The agent successfully moved the old, “cold” data to passive storage and then intelligently retrieved it only when the query demanded it.
This is a conceptually powerful model for building large-scale systems with virtually limitless memory while keeping the active context small and fast.
Choosing the Right Strategy
We have gone through nine distinct memory optimization strategies, from the simple to the highly complex. There is no single “best” strategy, the right choice is a careful balance of your agent’s needs, your budget, and your engineering resources.
Let’s understand when to choose what?
For simple, short-lived bots: Sequential or Sliding Window are perfect. They are easy to implement and get the job done.
For long, creative conversations: Summarization is a great choice to maintain the general flow without a massive token overhead.
For agents needing precise, long-term recall: Retrieval-Based memory is the industry standard. It’s powerful, scalable, and the foundation of most RAG applications.
For highly reliable personal assistants: Memory-Augmented or Hierarchical approaches provide a robust way to separate critical facts from conversational chatter.
For expert systems and knowledge bases: Graph-Based memory is unparalleled in its ability to reason about relationships between data points.[/li][/list]
The most powerful agents in production often use hybrid approaches, combining these techniques. You might use a hierarchical system where the long-term memory is a combination of both a vector database and a knowledge graph.
The key is to start with a clear understanding of what you need your agent to remember, for how long, and with what level of precision. By mastering these memory strategies, you can move beyond building simple chatbots and start creating truly intelligent agents that learn, remember, and perform better over time.
Source