Context Window Management

Every time you call an LLM, you pay for every token in the request — including the full conversation history. As conversations grow, so does your cost and latency. A 50-turn conversation can easily exceed 10,000 input tokens per request.

Engram solves this by extracting discrete facts from conversations and storing them as searchable memories. Instead of sending the entire history, you search for relevant memories and send only those — keeping context size flat regardless of conversation length.

This tutorial builds on the Memory Chat App pattern and shows you how to:

Measure the token cost of sending full conversation history
Replace history with memory search for constant-size context
Compare the two approaches side-by-side

Prerequisites

An Engram project with an API key (Quickstart)
An Anthropic or OpenAI API key
Python packages: pip install weaviate-engram anthropic openai

Step 1: The naive approach

The most common pattern is to append every message to a list and send the full list with each API call. This works for short conversations but becomes expensive fast.

def naive_chat_anthropic():
    """Naive approach: send full conversation history every time."""
    import anthropic

    anthropic_client = anthropic.Anthropic()
    messages = []

    while True:
        user_input = input("You: ")
        if user_input.lower() == "quit":
            break

        messages.append({"role": "user", "content": user_input})

        # Every call sends the ENTIRE conversation history
        response = anthropic_client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=1024,
            system="You are a helpful assistant.",
            messages=messages,  # This list grows with every turn
        )
        assistant_message = response.content[0].text
        messages.append({"role": "assistant", "content": assistant_message})
        print(f"Assistant: {assistant_message}")
        print(f"Messages in context: {len(messages)}\n")

The messages list grows by two entries every turn (user + assistant). By turn 50, you're sending 100 messages in every request. Token usage grows linearly with the naive approach and you're paying for the same messages over and over. Turn 1's messages are re-sent at turn 2, 3, 4, and every subsequent turn.

Step 2: Store conversations as memories

Instead of keeping messages in a growing list, send them to Engram after each exchange. Engram extracts discrete facts and stores them as searchable memories.

conversation = [
    {"role": "user", "content": "I'm a software engineer working on a Python web app."},
    {
        "role": "assistant",
        "content": "That sounds interesting! What framework are you using?",
    },
    {
        "role": "user",
        "content": "I'm using FastAPI with PostgreSQL. I prefer async patterns.",
    },
    {
        "role": "assistant",
        "content": "Great choices! FastAPI's async support works well with PostgreSQL.",
    },
    {
        "role": "user",
        "content": "I also use Redis for caching and Celery for background tasks.",
    },
    {
        "role": "assistant",
        "content": "That's a solid stack. Redis and Celery pair nicely with FastAPI.",
    },
]

run = client.memories.add(
    conversation,
    user_id=user_id,
    group="default",
)

status = client.runs.wait(run.run_id)
print(f"Run status: {status.status}")
print(f"Memories created: {len(status.memories_created)}")

From a 6-message conversation, Engram might extract memories like:

"The user is a software engineer"
"The user works primarily in Python"
"The user uses FastAPI with PostgreSQL"
"The user prefers async patterns"
"The user uses Redis for caching and Celery for background tasks"

Each fact is stored once and retrieved only when relevant.

Step 3: Replace history with memory search

Instead of sending the full conversation history, search Engram for relevant memories and keep only the last 2-3 exchanges for conversational continuity.

def memory_augmented_chat_anthropic():
    """Memory-augmented approach: use Engram instead of full history."""
    import anthropic

    engram = EngramClient(
        api_key=os.environ["ENGRAM_API_KEY"],
    )
    anthropic_client = anthropic.Anthropic()
    user_id = "user-123"
    recent_messages = []  # Keep only last few exchanges

    while True:
        user_input = input("You: ")
        if user_input.lower() == "quit":
            break

        # Search Engram for relevant memories
        results = engram.memories.search(
            query=user_input,
            user_id=user_id,
            group="default",
            retrieval_config=HybridRetrieval(limit=5),
        )
        memory_context = "\n".join(f"- {m.content}" for m in results)

        system_prompt = f"""You are a helpful assistant.

Relevant context from previous conversations:
{memory_context}"""

        recent_messages.append({"role": "user", "content": user_input})

        # Send only recent messages + memory context (not full history)
        response = anthropic_client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=1024,
            system=system_prompt,
            messages=recent_messages[-6:],  # Last 3 exchanges only
        )
        assistant_message = response.content[0].text
        recent_messages.append({"role": "assistant", "content": assistant_message})
        print(f"Assistant: {assistant_message}")
        print(f"Messages in context: {min(len(recent_messages), 6)}\n")

        # Store the exchange as a memory
        run = engram.memories.add(
            [recent_messages[-2], recent_messages[-1]],
            user_id=user_id,
            group="default",
        )
        engram.runs.wait(run.run_id)

    engram.close()

The context window now contains:

System prompt: ~5 retrieved memories (~50 tokens)
Recent messages: Last 2-3 exchanges (~4-6 messages, ~500-750 tokens)
Total: ~800 tokens — flat, regardless of conversation length

Step 4: Compare side-by-side

Here's a comparison of token usage as conversation length grows:

Turn   Naive (tokens)     Memory (tokens)    Savings
----------------------------------------------------------
    125                175                -40%
    625                425                32%
   1,250              425                66%
   2,500              425                83%
   6,250              425                93%

At turn 1, the memory approach has slight overhead from the search. By turn 10, it saves 66%. By turn 50, it saves 93% of input tokens.

Step 5: Advanced patterns

Topic filtering

If your project has multiple topics, filter search results to specific topics for more precise retrieval:

results = client.memories.search(
    query="What tech stack does the user prefer?",
    topics=["UserKnowledge"],
    user_id=user_id,
    group="default",
    retrieval_config=HybridRetrieval(limit=5),
)

for memory in results:
    print(f"- {memory.content} (topic: {memory.topic})")

Use a `ConversationSummary` topic for full history

For long conversations where you want the LLM to see every detail (not just discrete facts), let Engram maintain a single running summary of the entire conversation and replace the message history with that summary on each turn.

Enable the optional Include Conversation Summary Topic checkbox when creating your project from the Personalization template. This adds a ConversationSummary topic that's scoped by conversation_id and bounded to one memory per conversation. Each memories.add updates that memory in place.

Fetch it with the fetch retrieval type — it returns the bounded memory directly by topic and scope, without scoring by query relevance.

conversation_id = f"session-{uuid.uuid4().hex[:8]}"

# Add messages tied to a conversation_id. If the project enabled the
# `ConversationSummary` topic, the pipeline maintains one summary memory
# per conversation_id and rewrites it in place on each add.
run = client.memories.add(
    [
        {"role": "user", "content": "I'm planning a trip to Lisbon next month."},
        {"role": "assistant", "content": "Great choice! Any specific neighborhoods?"},
        {"role": "user", "content": "I'd love to stay in Alfama for the historic vibe."},
    ],
    user_id=user_id,
    group="default",
    properties={"conversation_id": conversation_id},
)
client.runs.wait(run.run_id)

# Fetch the running summary — `fetch` returns the bounded memory directly
# by topic + scope, without ranking by query relevance. The topic must be
# enabled in the project; otherwise the search returns a "topic not found"
# error.
try:
    summary_results = client.memories.search(
        query="conversation summary",  # ignored by fetch retrieval
        user_id=user_id,
        group="default",
        topics=["ConversationSummary"],
        properties={"conversation_id": conversation_id},
        retrieval_config=FetchRetrieval(limit=1),
    )
except APIError:
    summary_results = []  # ConversationSummary topic not enabled in this project

if summary_results:
    print(f"Summary: {summary_results[0].content}")

The summary stays one memory regardless of how long the conversation runs, so the token cost of including it in your LLM call is constant — and the LLM still sees the full conversational context.

Hybrid search tuning

Adjust the retrieval_config to control how memories are ranked:

# Pure semantic search — best for conceptual similarity
VectorRetrieval(limit=5)

# Keyword search — best for exact terms
BM25Retrieval(limit=5)

# Hybrid (recommended) — combines both approaches
HybridRetrieval(limit=10)

Dual-memory pattern

For the best balance of continuity and context, combine both approaches:

Recent messages (last 2-3 exchanges) — Maintains conversational flow
Engram memory search — Provides relevant historical context

This is the pattern used in Step 4. The recent messages handle references like "that" and "it", while Engram provides the long-term context that makes the assistant feel like it truly remembers.

Next steps

Memory Chat App — The foundational tutorial for integrating Engram with a chat app.
Personalized RAG — Add a knowledge base alongside per-user memory.
Store memories — Learn about all three content types (string, conversation, pre-extracted).

Questions and feedback

Have a question or feedback? Here's how to reach us.

Community Forum

Ask questions and connect with other developers on our Community forum.

Support

Weaviate Cloud user or customer? Find the right channel on the Support page.

Additional resources

Need help?

Context Window Management

Prerequisites

Step 1: The naive approach

Step 2: Store conversations as memories

Step 3: Replace history with memory search

Step 4: Compare side-by-side

Step 5: Advanced patterns

Topic filtering

Use a `ConversationSummary` topic for full history

Hybrid search tuning

Dual-memory pattern

Next steps

Questions and feedback

Additional resources

Need help?

Prerequisites​

Step 1: The naive approach​

Step 2: Store conversations as memories​

Step 3: Replace history with memory search​

Step 4: Compare side-by-side​

Step 5: Advanced patterns​

Topic filtering​

Use a ConversationSummary topic for full history​

Hybrid search tuning​

Dual-memory pattern​

Next steps​

Questions and feedback​

Prerequisites

Step 1: The naive approach

Step 2: Store conversations as memories

Step 3: Replace history with memory search

Step 4: Compare side-by-side

Step 5: Advanced patterns

Topic filtering

Use a `ConversationSummary` topic for full history

Hybrid search tuning

Dual-memory pattern

Next steps

Questions and feedback