Personalized RAG with Per-User Memory
Standard RAG retrieves the same documents for every user. But a Python developer and a JavaScript developer asking "How do I use the API?" should get different answers — tailored to their language, experience level, and preferences.
This tutorial combines a Weaviate knowledge base (shared product documentation) with Engram per-user memory (individual preferences and context) to build a personalized, multi-tenant RAG assistant.
You'll learn how to:
- Search both a knowledge base and user memory in parallel
- Build prompts that merge shared knowledge with personal context
- Isolate memory between users with
user_idscoping - Handle concurrent users with
AsyncEngramClient - Manage user data for privacy compliance
Prerequisites
- An Engram project with an API key (Quickstart)
- A Weaviate Cloud instance
- An Anthropic or OpenAI API key
- Python packages:
pip install weaviate-engram weaviate-client anthropic openai
Step 1: Set up both clients
Initialize the Engram client for user memory and the Weaviate client for your knowledge base.
engram = EngramClient(
api_key=os.environ["ENGRAM_API_KEY"]
)
import weaviate
weaviate_client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.environ["WEAVIATE_URL"],
auth_credentials=weaviate.auth.AuthApiKey(os.environ["WEAVIATE_API_KEY"]),
)
Step 2: Populate the knowledge base
Create a Weaviate collection with sample product documentation. This represents your shared knowledge base — the same docs are available to all users.
def populate_knowledge_base(weaviate_client):
"""Create a Weaviate collection and insert sample product documentation."""
collection = weaviate_client.collections.create(
name="ProductDocs",
description="Product documentation for the Acme platform",
)
docs = [
"Acme API supports REST and GraphQL endpoints for data access.",
"The Acme dashboard provides real-time analytics and custom reports.",
"Acme's Python SDK supports async operations with asyncio.",
"Acme pricing: Free tier up to 1K requests/day, Pro at $49/month for 100K requests/day.",
"Acme supports SSO with SAML and OIDC for enterprise customers.",
]
with collection.batch.dynamic() as batch:
for doc in docs:
batch.add_object(properties={"content": doc})
return collection
Step 3: Store user context in Engram
Store different preferences and facts for different users. Engram extracts discrete facts and scopes them to each user_id.
# Store preferences for Alice (Python developer)
run_a = engram.memories.add(
"I'm a Python developer. I prefer concise code examples. I'm building a FastAPI microservice.",
user_id=f"tutorial-rag-alice-{_test_suffix}",
group="default",
)
status_a = engram.runs.wait(run_a.run_id)
print(f"Alice: {status_a.status}, {len(status_a.memories_created)} memories")
# Store preferences for Bob (JavaScript developer)
run_b = engram.memories.add(
"I'm a JavaScript developer. I prefer detailed explanations with context. I'm building a React dashboard.",
user_id=f"tutorial-rag-bob-{_test_suffix}",
group="default",
)
status_b = engram.runs.wait(run_b.run_id)
print(f"Bob: {status_b.status}, {len(status_b.memories_created)} memories")
Alice and Bob now have separate memory stores. Alice's memories include facts like "Python developer" and "prefers concise examples", while Bob's include "JavaScript developer" and "prefers detailed explanations".
Step 4: Build the dual-search function
Create a function that searches both the Weaviate knowledge base and Engram user memory. The knowledge base provides the factual content; Engram provides the personalization context.
def dual_search(query, user_id, kb_results=None):
"""Search both a knowledge base and Engram user memory."""
# Search Engram for user-specific memories
user_memories = engram.memories.search(
query=query,
user_id=user_id,
group="default",
retrieval_config=HybridRetrieval(limit=5),
)
return {
"knowledge_base": kb_results or [],
"user_memories": user_memories,
}
In production, you'd also search Weaviate here:
def dual_search(query, user_id, weaviate_collection):
# Search knowledge base
kb_results = weaviate_collection.query.hybrid(query=query, limit=5)
kb_docs = [obj.properties["content"] for obj in kb_results.objects]
# Search user memory
user_memories = engram.memories.search(
query=query, user_id=user_id, group="default",
retrieval_config=HybridRetrieval(limit=5),
)
return {"knowledge_base": kb_docs, "user_memories": user_memories}
Step 5: Construct a personalized prompt
Merge the knowledge base results and user memories into a single prompt. The LLM uses the shared docs for accuracy and the user context for personalization.
def build_prompt_anthropic(query, kb_docs, user_memories, anthropic_client):
"""Build a personalized prompt combining KB docs and user memory."""
kb_context = "\n".join(f"- {doc}" for doc in kb_docs)
memory_context = "\n".join(f"- {m.content}" for m in user_memories)
system_prompt = f"""You are a helpful product assistant.
Product documentation:
{kb_context}
What you know about this user:
{memory_context}
Tailor your response to the user's background and preferences."""
response = anthropic_client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": query}],
)
return response.content[0].text
With this prompt structure, the same question produces different responses:
- Alice gets a concise Python example using the SDK's async features
- Bob gets a detailed explanation with JavaScript/REST examples
Step 6: Demo with two users
Both users ask the same question, but each gets a personalized response based on their stored context.
query = "How do I access the API?"
# Alice's personalized search
alice_memories = engram.memories.search(
query=query,
user_id=f"tutorial-rag-alice-{_test_suffix}",
group="default",
retrieval_config=HybridRetrieval(limit=5),
)
print("Alice's context:")
for m in alice_memories:
print(f" - {m.content}")
# Bob's personalized search
bob_memories = engram.memories.search(
query=query,
user_id=f"tutorial-rag-bob-{_test_suffix}",
group="default",
retrieval_config=HybridRetrieval(limit=5),
)
print("\nBob's context:")
for m in bob_memories:
print(f" - {m.content}")
Alice's search returns memories about Python and FastAPI. Bob's returns memories about JavaScript and React. The LLM uses this context to tailor its answer.
Step 7: Add user isolation
Engram's user_id scoping ensures strict memory isolation. User A's memories are never returned when searching as User B, even if the query matches.
# Alice searches for Bob's topics — should get no relevant results
alice_cross_search = engram.memories.search(
query="React dashboard JavaScript",
user_id=f"tutorial-rag-alice-{_test_suffix}",
group="default",
)
print(f"Alice searching for Bob's topics: {len(alice_cross_search)} results")
# Bob searches for Alice's topics — should get no relevant results
bob_cross_search = engram.memories.search(
query="FastAPI Python microservice",
user_id=f"tutorial-rag-bob-{_test_suffix}",
group="default",
)
print(f"Bob searching for Alice's topics: {len(bob_cross_search)} results")
Alice searching for "React dashboard JavaScript" finds nothing — those memories belong to Bob. And vice versa. This isolation is enforced at the storage layer, not just filtered in application code.
User isolation is automatic when you use user_id with user-scoped topics. You don't need additional access control logic — Engram handles it.
Step 8: Scale with async
For production applications handling multiple concurrent users, switch to AsyncEngramClient. It uses the same API but supports async/await for non-blocking operations.
async_client = AsyncEngramClient(
api_key=os.environ["ENGRAM_API_KEY"]
)
Use asyncio.gather() to handle multiple user requests concurrently:
async def search_for_user(async_engram, user_id, query):
"""Search memories for a single user."""
results = await async_engram.memories.search(
query=query,
user_id=user_id,
group="default",
retrieval_config=HybridRetrieval(limit=5),
)
return user_id, results
async def handle_concurrent_users():
"""Handle multiple users concurrently."""
tasks = [
search_for_user(async_client, f"tutorial-rag-alice-{_test_suffix}", "programming preferences"),
search_for_user(async_client, f"tutorial-rag-bob-{_test_suffix}", "programming preferences"),
]
results = await asyncio.gather(*tasks)
for uid, memories in results:
print(f"\n{uid}:")
for m in memories:
print(f" - {m.content}")
return results
concurrent_results = asyncio.run(handle_concurrent_users())
Both searches run in parallel, reducing total latency compared to sequential calls.
Step 9: User data management
For privacy compliance (e.g. GDPR right to deletion), you can retrieve and delete individual memories or all memories for a user.
# Retrieve a specific memory by ID
alice_memories = engram.memories.search(
query="Python developer",
user_id=f"tutorial-rag-alice-{_test_suffix}",
group="default",
)
if alice_memories:
memory = engram.memories.get(
alice_memories[0].id,
user_id=f"tutorial-rag-alice-{_test_suffix}",
group="default",
)
print(f"Retrieved: {memory.content}")
# Delete all of a user's memories (e.g. GDPR right-to-deletion)
for m in alice_memories:
engram.memories.delete(m.id, user_id=f"tutorial-rag-alice-{_test_suffix}", group="default")
print(f"Deleted: {m.id}")
# Verify deletion
remaining = engram.memories.search(
query="Python developer",
user_id=f"tutorial-rag-alice-{_test_suffix}",
group="default",
)
print(f"Remaining memories for Alice: {len(remaining)}")
Use memories.search() to find all memories for a user, then memories.delete() each one. After deletion, subsequent searches return no results for that user.
Next steps
- Memory Chat App — The foundational tutorial for integrating Engram with a chat app.
- Context Window Management — Reduce token costs by replacing conversation history with memory.
- Manage memories — API reference for get and delete operations.
- Core concepts — Learn about topics, groups, scoping, and pipelines.
Questions and feedback
If you have any questions or feedback, let us know in the user forum.
