Skip to main content
Go to documentation:
⌘U
Weaviate Database

Develop AI applications using Weaviate's APIs and tools

Deploy

Deploy, configure, and maintain Weaviate Database

Query Agent

Run agentic search over your Weaviate Cloud collections

Weaviate Cloud

Manage and scale Weaviate in the cloud

Engram

Persistent memory for LLM agents and applications

Additional resources

Integrations
Contributor guide
Events & Workshops
Weaviate Academy

Need help?

Weaviate LogoAsk AI Assistant⌘K
Community Forum

Context Window Management

Every time you call an LLM, you pay for every token in the request — including the full conversation history. As conversations grow, so does your cost and latency. A 50-turn conversation can easily exceed 10,000 input tokens per request.

Engram solves this by extracting discrete facts from conversations and storing them as searchable memories. Instead of sending the entire history, you search for relevant memories and send only those — keeping context size flat regardless of conversation length.

This tutorial builds on the Memory Chat App pattern and shows you how to:

  • Measure the token cost of sending full conversation history
  • Replace history with memory search for constant-size context
  • Compare the two approaches side-by-side

Prerequisites

  • An Engram project with an API key (Quickstart)
  • An Anthropic or OpenAI API key
  • Python packages: pip install weaviate-engram anthropic openai

Step 1: The naive approach

The most common pattern is to append every message to a list and send the full list with each API call. This works for short conversations but becomes expensive fast.

def naive_chat_anthropic():
"""Naive approach: send full conversation history every time."""
import anthropic

anthropic_client = anthropic.Anthropic()
messages = []

while True:
user_input = input("You: ")
if user_input.lower() == "quit":
break

messages.append({"role": "user", "content": user_input})

# Every call sends the ENTIRE conversation history
response = anthropic_client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
system="You are a helpful assistant.",
messages=messages, # This list grows with every turn
)
assistant_message = response.content[0].text
messages.append({"role": "assistant", "content": assistant_message})
print(f"Assistant: {assistant_message}")
print(f"Messages in context: {len(messages)}\n")

The messages list grows by two entries every turn (user + assistant). By turn 50, you're sending 100 messages in every request. Token usage grows linearly with the naive approach and you're paying for the same messages over and over. Turn 1's messages are re-sent at turn 2, 3, 4, and every subsequent turn.

Step 2: Store conversations as memories

Instead of keeping messages in a growing list, send them to Engram after each exchange. Engram extracts discrete facts and stores them as searchable memories.

conversation = [
{"role": "user", "content": "I'm a software engineer working on a Python web app."},
{
"role": "assistant",
"content": "That sounds interesting! What framework are you using?",
},
{
"role": "user",
"content": "I'm using FastAPI with PostgreSQL. I prefer async patterns.",
},
{
"role": "assistant",
"content": "Great choices! FastAPI's async support works well with PostgreSQL.",
},
{
"role": "user",
"content": "I also use Redis for caching and Celery for background tasks.",
},
{
"role": "assistant",
"content": "That's a solid stack. Redis and Celery pair nicely with FastAPI.",
},
]

run = client.memories.add(
conversation,
user_id=user_id,
group="default",
)

status = client.runs.wait(run.run_id)
print(f"Run status: {status.status}")
print(f"Memories created: {len(status.memories_created)}")

From a 6-message conversation, Engram might extract memories like:

  • "The user is a software engineer"
  • "The user works primarily in Python"
  • "The user uses FastAPI with PostgreSQL"
  • "The user prefers async patterns"
  • "The user uses Redis for caching and Celery for background tasks"

Each fact is stored once and retrieved only when relevant.

Instead of sending the full conversation history, search Engram for relevant memories and keep only the last 2-3 exchanges for conversational continuity.

def memory_augmented_chat_anthropic():
"""Memory-augmented approach: use Engram instead of full history."""
import anthropic

engram = EngramClient(
api_key=os.environ["ENGRAM_API_KEY"],
)
anthropic_client = anthropic.Anthropic()
user_id = "user-123"
recent_messages = [] # Keep only last few exchanges

while True:
user_input = input("You: ")
if user_input.lower() == "quit":
break

# Search Engram for relevant memories
results = engram.memories.search(
query=user_input,
user_id=user_id,
group="default",
retrieval_config=HybridRetrieval(limit=5),
)
memory_context = "\n".join(f"- {m.content}" for m in results)

system_prompt = f"""You are a helpful assistant.

Relevant context from previous conversations:
{memory_context}"""

recent_messages.append({"role": "user", "content": user_input})

# Send only recent messages + memory context (not full history)
response = anthropic_client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
system=system_prompt,
messages=recent_messages[-6:], # Last 3 exchanges only
)
assistant_message = response.content[0].text
recent_messages.append({"role": "assistant", "content": assistant_message})
print(f"Assistant: {assistant_message}")
print(f"Messages in context: {min(len(recent_messages), 6)}\n")

# Store the exchange as a memory
run = engram.memories.add(
[recent_messages[-2], recent_messages[-1]],
user_id=user_id,
group="default",
)
engram.runs.wait(run.run_id)

engram.close()

The context window now contains:

  • System prompt: ~5 retrieved memories (~50 tokens)
  • Recent messages: Last 2-3 exchanges (~4-6 messages, ~500-750 tokens)
  • Total: ~800 tokens — flat, regardless of conversation length

Step 4: Compare side-by-side

Here's a comparison of token usage as conversation length grows:

Turn   Naive (tokens)     Memory (tokens)    Savings
----------------------------------------------------------
1 125 175 -40%
5 625 425 32%
10 1,250 425 66%
20 2,500 425 83%
50 6,250 425 93%

At turn 1, the memory approach has slight overhead from the search. By turn 10, it saves 66%. By turn 50, it saves 93% of input tokens.

Step 5: Advanced patterns

Topic filtering

If your project has multiple topics, filter search results to specific topics for more precise retrieval:

results = client.memories.search(
query="What tech stack does the user prefer?",
topics=["UserKnowledge"],
user_id=user_id,
group="default",
retrieval_config=HybridRetrieval(limit=5),
)

for memory in results:
print(f"- {memory.content} (topic: {memory.topic})")

Use a ConversationSummary topic for full history

For long conversations where you want the LLM to see every detail (not just discrete facts), let Engram maintain a single running summary of the entire conversation and replace the message history with that summary on each turn.

Enable the optional Include Conversation Summary Topic checkbox when creating your project from the Personalization template. This adds a ConversationSummary topic that's scoped by conversation_id and bounded to one memory per conversation. Each memories.add updates that memory in place.

Fetch it with the fetch retrieval type — it returns the bounded memory directly by topic and scope, without scoring by query relevance.

conversation_id = f"session-{uuid.uuid4().hex[:8]}"

# Add messages tied to a conversation_id. If the project enabled the
# `ConversationSummary` topic, the pipeline maintains one summary memory
# per conversation_id and rewrites it in place on each add.
run = client.memories.add(
[
{"role": "user", "content": "I'm planning a trip to Lisbon next month."},
{"role": "assistant", "content": "Great choice! Any specific neighborhoods?"},
{"role": "user", "content": "I'd love to stay in Alfama for the historic vibe."},
],
user_id=user_id,
group="default",
properties={"conversation_id": conversation_id},
)
client.runs.wait(run.run_id)

# Fetch the running summary — `fetch` returns the bounded memory directly
# by topic + scope, without ranking by query relevance. The topic must be
# enabled in the project; otherwise the search returns a "topic not found"
# error.
try:
summary_results = client.memories.search(
query="conversation summary", # ignored by fetch retrieval
user_id=user_id,
group="default",
topics=["ConversationSummary"],
properties={"conversation_id": conversation_id},
retrieval_config=FetchRetrieval(limit=1),
)
except APIError:
summary_results = [] # ConversationSummary topic not enabled in this project

if summary_results:
print(f"Summary: {summary_results[0].content}")

The summary stays one memory regardless of how long the conversation runs, so the token cost of including it in your LLM call is constant — and the LLM still sees the full conversational context.

Hybrid search tuning

Adjust the retrieval_config to control how memories are ranked:

# Pure semantic search — best for conceptual similarity
VectorRetrieval(limit=5)

# Keyword search — best for exact terms
BM25Retrieval(limit=5)

# Hybrid (recommended) — combines both approaches
HybridRetrieval(limit=10)

Dual-memory pattern

For the best balance of continuity and context, combine both approaches:

  1. Recent messages (last 2-3 exchanges) — Maintains conversational flow
  2. Engram memory search — Provides relevant historical context

This is the pattern used in Step 4. The recent messages handle references like "that" and "it", while Engram provides the long-term context that makes the assistant feel like it truly remembers.

Next steps

  • Memory Chat App — The foundational tutorial for integrating Engram with a chat app.
  • Personalized RAG — Add a knowledge base alongside per-user memory.
  • Store memories — Learn about all three content types (string, conversation, pre-extracted).

Questions and feedback

If you have any questions or feedback, let us know in the user forum.