Memory Systems for LLM Applications
Why Memory? LLMs are stateless by default. Memory systems maintain context across conversation turns and sessions, enabling coherent multi-turn interactions.
Memory Types
Short-Term (Conversation Buffer) Store recent messages in full:
class ConversationMemory:
def __init__(self):
self.messages = []
def add(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
def get_messages(self) -> list:
return self.messages
Window Memory Keep only last N turns:
class WindowMemory:
def __init__(self, window_size: int = 10):
self.messages = []
self.window_size = window_size
def add(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
if len(self.messages) > self.window_size:
self.messages = self.messages[-self.window_size:]
Summary Memory Periodically summarize older messages:
class SummaryMemory:
def __init__(self, llm):
self.llm = llm
self.summary = ""
self.recent_messages = []
def compress(self):
if len(self.recent_messages) > 10:
self.summary = self.llm.generate(
f"Summarize: {self.recent_messages[:5]}"
)
self.recent_messages = self.recent_messages[5:]
Entity Memory Track entities mentioned in conversation:
entities = {
"John": {"role": "customer", "mentioned": ["order #123"]},
"Project Alpha": {"status": "in progress", "deadline": "Q2"}
}
Long-Term Memory
Vector Storage Store and retrieve past interactions by similarity:
# Store interaction embedding
embedding = embed(conversation_summary)
vector_store.add(embedding, metadata={"session_id": ...})
# Retrieve relevant history
relevant = vector_store.query(embed(current_query), top_k=5)
Key-Value Store Store structured information:
- User preferences
- Past decisions
- Learned facts
Memory in Practice
| Memory Type | Use Case | Tradeoff |
|---|---|---|
| Full buffer | Short convos | Token limit |
| Window | Long convos | Loses early context |
| Summary | Very long convos | Compression loss |
| Vector | Cross-session | Retrieval latency |
| Entity | Fact tracking | Maintenance overhead |
Best Practices
- Combine memory types for different needs
- Compress aggressively for long contexts
- Consider privacy (what to remember/forget)
- Persist across restarts for production apps
memoryconversation historycontext
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.