Skip to main content
Go to documentation:
⌘U
Weaviate Database

Develop AI applications using Weaviate's APIs and tools

Deploy

Deploy, configure, and maintain Weaviate Database

Weaviate Agents

Build and deploy intelligent agents with Weaviate

Weaviate Cloud

Manage and scale Weaviate in the cloud

Additional resources

Integrations
Contributor guide
Events & Workshops
Weaviate Academy

Need help?

Weaviate LogoAsk AI Assistant⌘K
Community Forum

Configure tokenization for keyword search

What you'll learn

In this tutorial, you'll learn how to configure tokenization in Weaviate and see how different tokenization methods impact keyword search and filtering results.

By the end of this tutorial, you'll understand:

  • How to configure tokenization for a collection property
  • How tokenization affects filter matching
  • How tokenization impacts keyword search ranking
  • How to choose the right tokenization method for your use case

Prerequisites

  • A running Weaviate instance
  • Python Weaviate client installed
  • Basic familiarity with Weaviate collections

Create a demo collection

We'll create a collection with multiple properties, each using a different tokenization method. This allows us to compare how the same text behaves under different tokenization strategies.

import weaviate
from weaviate.classes.config import Property, DataType, Tokenization, Configure

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local()

tkn_options = [
Tokenization.WORD,
Tokenization.LOWERCASE,
Tokenization.WHITESPACE,
Tokenization.FIELD,
]

# Create a property for each tokenization option
properties = []
for tokenization in tkn_options:
prop = Property(
name=f"text_{tokenization.replace('.', '_')}",
data_type=DataType.TEXT,
tokenization=tokenization
)
properties.append(prop)


client.collections.create(
name="TokenizationDemo",
properties=properties,
vector_config=Configure.Vectors.self_provided()
)

client.close()

Note that we do not add object vectors in this case, as we are only interested in the impact of tokenization on filters and keyword searches.

Add test data

We'll use a small, custom dataset for demonstration purposes.

collection = client.collections.use("TokenizationDemo")

phrases = [
# string with special characters
"Lois & Clark: The New Adventures of Superman",

# strings with stopwords & varying orders
"computer mouse",
"Computer Mouse",
"mouse computer",
"computer mouse pad",
"a computer mouse",

# strings without spaces
"variable_name",
"Variable_Name",
"Variable Name",
"a_variable_name",
"the_variable_name",
"variable_new_name",
]

Now, add objects to the collection, repeating text objects across properties with different tokenization methods.

import weaviate

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local()

collection = client.collections.use("TokenizationDemo")

# Get property names
property_names = [p.name for p in collection.config.get().properties]

phrases = [
# string with special characters
"Lois & Clark: The New Adventures of Superman",

# strings with stopwords & varying orders
"computer mouse",
"Computer Mouse",
"mouse computer",
"computer mouse pad",
"a computer mouse",

# strings without spaces
"variable_name",
"Variable_Name",
"Variable Name",
"a_variable_name",
"the_variable_name",
"variable_new_name",
]

for phrase in phrases:
obj_properties = {}
for property_name in property_names:
obj_properties[property_name] = phrase
print(obj_properties)
collection.data.insert(properties=obj_properties)

client.close()

Example 1: Punctuation and case sensitivity

Let's see how tokenization handles messy text with punctuation and mixed cases. We'll filter for various combinations of substrings from the TV show title "Lois & Clark: The New Adventures of Superman".

Setup the filter function

We'll create a reusable function to filter objects based on query strings. Remember that a filter is binary: it either matches or it doesn't.

import weaviate
from weaviate.classes.query import Filter
from weaviate.collections import Collection
from typing import List

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local()

collection = client.collections.use("TokenizationDemo")

# Get property names
property_names = list()
for p in collection.config.get().properties:
property_names.append(p.name)

query_strings = ["<YOUR_QUERY_STRING>"]


def filter_demo(collection: Collection, property_names: List[str], query_strings: List[str]):
from weaviate.classes.query import Filter

for query_string in query_strings:
print("\n" + "=" * 40 + f"\nHits for: '{query_string}'" + "\n" + "=" * 40)
for property_name in property_names:
response = collection.query.fetch_objects(
filters=Filter.by_property(property_name).equal(query_string),
)
if len(response.objects) > 0:
print(f">> '{property_name}' matches")
for obj in response.objects:
print(obj.properties[property_name])


filter_demo(collection, property_names, query_strings)

Test "Clark:" vs "clark"

filter_demo(collection, property_names, ["clark", "Clark", "clark:", "Clark:", "lois clark", "clark lois"])

The results show whether the query matched the title:

wordlowercasewhitespacefield
"clark"
"Clark"
"clark:"
"Clark:"
"lois clark"
"clark lois"
Example output
========================================
Hits for: 'clark'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman

========================================
Hits for: 'Clark'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman

========================================
Hits for: 'clark:'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman
>> 'text_lowercase' matches
Lois & Clark: The New Adventures of Superman

========================================
Hits for: 'Clark:'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman
>> 'text_lowercase' matches
Lois & Clark: The New Adventures of Superman
>> 'text_whitespace' matches
Lois & Clark: The New Adventures of Superman

========================================
Hits for: 'lois clark'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman

========================================
Hits for: 'clark lois'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman

Key observations:

  • word tokenization consistently matches regardless of case or punctuation
  • lowercase and whitespace require more exact matches
  • Users typically don't include punctuation in queries, making word a good default

Example 2: Stop words

Here, we filter for variants of the phrase "computer mouse", where some queries include additional words like "a" or "the".

filter_demo(collection, property_names, ["computer mouse", "a computer mouse", "the computer mouse", "blue computer mouse"])

Matches for "computer mouse"

wordlowercasewhitespacefield
"computer mouse"
"a computer mouse"
"the computer mouse:"
"blue computer mouse"

Matches for "a computer mouse"

wordlowercasewhitespacefield
"computer mouse"
"a computer mouse"
"the computer mouse:"
"blue computer mouse"
Example output
========================================
Hits for: 'computer mouse'
========================================
>> 'text_word' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_lowercase' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_whitespace' matches
computer mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_field' matches
computer mouse

========================================
Hits for: 'a computer mouse'
========================================
>> 'text_word' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_lowercase' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_whitespace' matches
computer mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_field' matches
a computer mouse

========================================
Hits for: 'the computer mouse'
========================================
>> 'text_word' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_lowercase' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_whitespace' matches
computer mouse
mouse computer
computer mouse pad
a computer mouse

========================================
Hits for: 'blue computer mouse'
========================================

Key observations:

  • Stop words like "a" and "the" are ignored in word, lowercase, and whitespace tokenization
  • field tokenization treats the entire string as one token, so stop words matter
  • Adding non-stop words like "blue" prevents matches

Example 3: Symbols and underscores

The word tokenization is a good default, but may not work for data with meaningful symbols. Let's test different variants of "variable_name".

filter_demo(collection, property_names, ["variable_name"])
wordlowercasewhitespacefield
"variable_name"
"Variable_Name:"
"Variable Name:"
"a_variable_name"
"the_variable_name"
"variable_new_name"
Example output
========================================
Hits for: 'variable_name'
========================================
>> 'text_word' matches
variable_name
Variable_Name
Variable Name
a_variable_name
the_variable_name
variable_new_name
>> 'text_lowercase' matches
variable_name
Variable_Name
>> 'text_whitespace' matches
variable_name
>> 'text_field' matches
variable_name

Key observations:

  • word tokenization treats underscores as separators, which may be too permissive
  • For code, email addresses, or data where symbols are meaningful, use lowercase or whitespace
  • Consider whether "variable_new_name" should match "variable_name" in your use case

Keyword searches vs filters

Tokenization impacts keyword searches similarly to filters, but with important differences.

Setup the search function

import weaviate
from weaviate.classes.query import MetadataQuery
from weaviate.collections import Collection
from typing import List

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local()

collection = client.collections.use("TokenizationDemo")

# Get property names
property_names = list()
for p in collection.config.get().properties:
property_names.append(p.name)

query_strings = ["<YOUR_QUERY_STRING>"]


def search_demo(collection: Collection, property_names: List[str], query_strings: List[str]):
from weaviate.classes.query import MetadataQuery

for query_string in query_strings:
print("\n" + "=" * 40 + f"\nBM25 search results for: '{query_string}'" + "\n" + "=" * 40)
for property_name in property_names:
response = collection.query.bm25(
query=query_string,
return_metadata=MetadataQuery(score=True),
query_properties=[property_name]
)
if len(response.objects) > 0:
print(f">> '{property_name}' search results")
for obj in response.objects:
print(obj.properties[property_name], round(obj.metadata.score, 3))


search_demo(collection, property_names, query_strings)

Keyword search differences

Keyword searches use the BM25f algorithm to rank results. Tokenization has two effects:

  1. Inclusion: Determines whether a result appears at all
  2. Ranking: Affects the score based on matching tokens

Let's revisit the "Clark" example with keyword search:

search_demo(collection, property_names, ["clark", "Clark", "clark:", "Clark:", "lois clark", "clark lois"])
wordlowercasewhitespacefield
"clark"0.613
"Clark"0.613
"clark:" 0.6130.48
"Clark:" 0.6130.480.48
"lois clark"1.2260.48
"clark lois"1.2260.48

Key observations:

  • More matching tokens = higher scores (e.g., "lois clark" scores higher than "clark")
  • Keyword search returns objects matching ANY token (not just ALL tokens)
  • Scores vary based on token matching frequency
search_demo(collection, property_names, ["computer mouse", "a computer mouse", "the computer mouse", "blue computer mouse"])

Matches for "computer mouse"

wordlowercasewhitespacefield
"computer mouse"0.8890.8191.010.982
"Computer Mouse"0.8890.819
"a computer mouse"0.7640.7640.849
"computer mouse pad" 0.7640.7640.849

Matches for "a computer mouse"

wordlowercasewhitespacefield
"computer mouse"0.8890.8191.01
"Computer Mouse"0.8890.819
"a computer mouse"0.7641.5521.7120.982
"computer mouse pad" 0.7640.6880.849

Key observations:

  • Stop words don't prevent matches, but affect ranking
  • Scores differ for objects with/without stop words
  • lowercase and whitespace don't remove stop words from queries, giving users more control

Choosing your tokenization method

Based on what we've learned, here's guidance for choosing a tokenization method:

Use word (default) when:

  • Working with typical text data (articles, descriptions, names)
  • Users won't include exact punctuation in queries
  • Case-insensitivity is desired
  • You want forgiving search behavior

Use lowercase when:

  • Symbols like &, @, _, - are meaningful
  • Working with code snippets, email addresses, or technical notation
  • You want case-insensitivity but need to preserve symbols

Use whitespace when:

  • Case sensitivity is important (entity names, acronyms)
  • Symbols are meaningful
  • You can handle case-sensitivity in your query construction

Use field when:

  • Exact matches are required
  • Working with unique identifiers (URLs, IDs, exact email addresses)
  • You'll use wildcard filters for partial matches
  • Note: Can be slow with wildcards; use judiciously

Hybrid searches

A hybrid search combines keyword search and vector search results. Tokenization only impacts the keyword search portion; the vector search part uses the model's built-in tokenization.

Summary

You've learned how to:

  • Configure different tokenization methods for collection properties
  • Test and compare tokenization behavior with filters and searches
  • Understand the trade-offs between different tokenization methods
  • Choose the appropriate tokenization method for your use case

The key takeaway: tokenization is a core part of your search strategy. Start with word as a sensible default, but adjust based on your data characteristics and user expectations.

Next steps

Questions and feedback

If you have any questions or feedback, let us know in the user forum.