Configure tokenization for keyword search

What you'll learn

In this tutorial, you'll learn how to configure tokenization in Weaviate and see how different tokenization methods impact keyword search and filtering results.

By the end of this tutorial, you'll understand:

How to configure tokenization for a collection property
How tokenization affects filter matching
How tokenization impacts keyword search ranking
How to choose the right tokenization method for your use case

Prerequisites

A running Weaviate instance
Python Weaviate client installed
Basic familiarity with Weaviate collections

Create a demo collection

We'll create a collection with multiple properties, each using a different tokenization method. This allows us to compare how the same text behaves under different tokenization strategies.

import weaviate
from weaviate.classes.config import Property, DataType, Tokenization, Configure

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local()

tkn_options = [
    Tokenization.WORD,
    Tokenization.LOWERCASE,
    Tokenization.WHITESPACE,
    Tokenization.FIELD,
]

# Create a property for each tokenization option
properties = []
for tokenization in tkn_options:
    prop = Property(
        name=f"text_{tokenization.replace('.', '_')}",
        data_type=DataType.TEXT,
        tokenization=tokenization
    )
    properties.append(prop)


client.collections.create(
    name="TokenizationDemo",
    properties=properties,
    vector_config=Configure.Vectors.self_provided()
)

client.close()

Note that we do not add object vectors in this case, as we are only interested in the impact of tokenization on filters and keyword searches.

Add test data

We'll use a small, custom dataset for demonstration purposes.

collection = client.collections.use("TokenizationDemo")

phrases = [
    # string with special characters
    "Lois & Clark: The New Adventures of Superman",

    # strings with stopwords & varying orders
    "computer mouse",
    "Computer Mouse",
    "mouse computer",
    "computer mouse pad",
    "a computer mouse",

    # strings without spaces
    "variable_name",
    "Variable_Name",
    "Variable Name",
    "a_variable_name",
    "the_variable_name",
    "variable_new_name",
]

Now, add objects to the collection, repeating text objects across properties with different tokenization methods.

import weaviate

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local()

collection = client.collections.use("TokenizationDemo")

# Get property names
property_names = [p.name for p in collection.config.get().properties]

phrases = [
    # string with special characters
    "Lois & Clark: The New Adventures of Superman",

    # strings with stopwords & varying orders
    "computer mouse",
    "Computer Mouse",
    "mouse computer",
    "computer mouse pad",
    "a computer mouse",

    # strings without spaces
    "variable_name",
    "Variable_Name",
    "Variable Name",
    "a_variable_name",
    "the_variable_name",
    "variable_new_name",
]

for phrase in phrases:
    obj_properties = {}
    for property_name in property_names:
        obj_properties[property_name] = phrase
    print(obj_properties)
    collection.data.insert(properties=obj_properties)

client.close()

Example 1: Punctuation and case sensitivity

Let's see how tokenization handles messy text with punctuation and mixed cases. We'll filter for various combinations of substrings from the TV show title "Lois & Clark: The New Adventures of Superman".

Setup the filter function

We'll create a reusable function to filter objects based on query strings. Remember that a filter is binary: it either matches or it doesn't.

import weaviate
from weaviate.classes.query import Filter
from weaviate.collections import Collection
from typing import List

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local()

collection = client.collections.use("TokenizationDemo")

# Get property names
property_names = list()
for p in collection.config.get().properties:
    property_names.append(p.name)

query_strings = ["<YOUR_QUERY_STRING>"]


def filter_demo(collection: Collection, property_names: List[str], query_strings: List[str]):
    from weaviate.classes.query import Filter

    for query_string in query_strings:
        print("\n" + "=" * 40 + f"\nHits for: '{query_string}'" + "\n" + "=" * 40)
        for property_name in property_names:
            response = collection.query.fetch_objects(
                filters=Filter.by_property(property_name).equal(query_string),
            )
            if len(response.objects) > 0:
                print(f">> '{property_name}' matches")
                for obj in response.objects:
                    print(obj.properties[property_name])


filter_demo(collection, property_names, query_strings)

Test "Clark:" vs "clark"

filter_demo(collection, property_names, ["clark", "Clark", "clark:", "Clark:", "lois clark", "clark lois"])

The results show whether the query matched the title:

	`word`	`lowercase`	`whitespace`	`field`
`"clark"`	✅	❌	❌	❌
`"Clark"`	✅	❌	❌	❌
`"clark:"`	✅	✅	❌	❌
`"Clark:"`	✅	✅	✅	❌
`"lois clark"`	✅	❌	❌	❌
`"clark lois"`	✅	❌	❌	❌

Example output

========================================
Hits for: 'clark'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman

========================================
Hits for: 'Clark'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman

========================================
Hits for: 'clark:'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman
>> 'text_lowercase' matches
Lois & Clark: The New Adventures of Superman

========================================
Hits for: 'Clark:'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman
>> 'text_lowercase' matches
Lois & Clark: The New Adventures of Superman
>> 'text_whitespace' matches
Lois & Clark: The New Adventures of Superman

========================================
Hits for: 'lois clark'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman

========================================
Hits for: 'clark lois'
========================================
>> 'text_word' matches
Lois & Clark: The New Adventures of Superman

Key observations:

word tokenization consistently matches regardless of case or punctuation
lowercase and whitespace require more exact matches
Users typically don't include punctuation in queries, making word a good default

Example 2: Stop words

Here, we filter for variants of the phrase "computer mouse", where some queries include additional words like "a" or "the".

filter_demo(collection, property_names, ["computer mouse", "a computer mouse", "the computer mouse", "blue computer mouse"])

Matches for "computer mouse"

	`word`	`lowercase`	`whitespace`	`field`
`"computer mouse"`	✅	✅	✅	✅
`"a computer mouse"`	✅	✅	✅	❌
`"the computer mouse:"`	✅	✅	✅	❌
`"blue computer mouse"`	❌	❌	❌	❌

Matches for "a computer mouse"

	`word`	`lowercase`	`whitespace`	`field`
`"computer mouse"`	✅	✅	✅	❌
`"a computer mouse"`	✅	✅	✅	✅
`"the computer mouse:"`	✅	✅	✅	❌
`"blue computer mouse"`	❌	❌	❌	❌

Example output

========================================
Hits for: 'computer mouse'
========================================
>> 'text_word' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_lowercase' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_whitespace' matches
computer mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_field' matches
computer mouse

========================================
Hits for: 'a computer mouse'
========================================
>> 'text_word' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_lowercase' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_whitespace' matches
computer mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_field' matches
a computer mouse

========================================
Hits for: 'the computer mouse'
========================================
>> 'text_word' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_lowercase' matches
computer mouse
Computer Mouse
mouse computer
computer mouse pad
a computer mouse
>> 'text_whitespace' matches
computer mouse
mouse computer
computer mouse pad
a computer mouse

========================================
Hits for: 'blue computer mouse'
========================================

Key observations:

Stop words like "a" and "the" are ignored in word, lowercase, and whitespace tokenization
field tokenization treats the entire string as one token, so stop words matter
Adding non-stop words like "blue" prevents matches

Example 3: Symbols and underscores

The word tokenization is a good default, but may not work for data with meaningful symbols. Let's test different variants of "variable_name".

filter_demo(collection, property_names, ["variable_name"])

	`word`	`lowercase`	`whitespace`	`field`
`"variable_name"`	✅	✅	✅	✅
`"Variable_Name:"`	✅	✅	❌	❌
`"Variable Name:"`	✅	❌	❌	❌
`"a_variable_name"`	✅	❌	❌	❌
`"the_variable_name"`	✅	❌	❌	❌
`"variable_new_name"`	✅	❌	❌	❌

Example output

========================================
Hits for: 'variable_name'
========================================
>> 'text_word' matches
variable_name
Variable_Name
Variable Name
a_variable_name
the_variable_name
variable_new_name
>> 'text_lowercase' matches
variable_name
Variable_Name
>> 'text_whitespace' matches
variable_name
>> 'text_field' matches
variable_name

Key observations:

word tokenization treats underscores as separators, which may be too permissive
For code, email addresses, or data where symbols are meaningful, use lowercase or whitespace
Consider whether "variable_new_name" should match "variable_name" in your use case

Keyword searches vs filters

Tokenization impacts keyword searches similarly to filters, but with important differences.

Setup the search function

import weaviate
from weaviate.classes.query import MetadataQuery
from weaviate.collections import Collection
from typing import List

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local()

collection = client.collections.use("TokenizationDemo")

# Get property names
property_names = list()
for p in collection.config.get().properties:
    property_names.append(p.name)

query_strings = ["<YOUR_QUERY_STRING>"]


def search_demo(collection: Collection, property_names: List[str], query_strings: List[str]):
    from weaviate.classes.query import MetadataQuery

    for query_string in query_strings:
        print("\n" + "=" * 40 + f"\nBM25 search results for: '{query_string}'" + "\n" + "=" * 40)
        for property_name in property_names:
            response = collection.query.bm25(
                query=query_string,
                return_metadata=MetadataQuery(score=True),
                query_properties=[property_name]
            )
            if len(response.objects) > 0:
                print(f">> '{property_name}' search results")
                for obj in response.objects:
                    print(obj.properties[property_name], round(obj.metadata.score, 3))


search_demo(collection, property_names, query_strings)

Keyword search differences

Keyword searches use the BM25f algorithm to rank results. Tokenization has two effects:

Inclusion: Determines whether a result appears at all
Ranking: Affects the score based on matching tokens

Let's revisit the "Clark" example with keyword search:

search_demo(collection, property_names, ["clark", "Clark", "clark:", "Clark:", "lois clark", "clark lois"])

	`word`	`lowercase`	`whitespace`	`field`
`"clark"`	0.613	❌	❌	❌
`"Clark"`	0.613	❌	❌	❌
`"clark:"`	0.613	0.48	❌	❌
`"Clark:"`	0.613	0.48	0.48	❌
`"lois clark"`	1.226	0.48	❌	❌
`"clark lois"`	1.226	0.48	❌	❌

Key observations:

More matching tokens = higher scores (e.g., "lois clark" scores higher than "clark")
Keyword search returns objects matching ANY token (not just ALL tokens)
Scores vary based on token matching frequency

Stop words in keyword search

search_demo(collection, property_names, ["computer mouse", "a computer mouse", "the computer mouse", "blue computer mouse"])

Matches for "computer mouse"

	`word`	`lowercase`	`whitespace`	`field`
`"computer mouse"`	0.889	0.819	1.01	0.982
`"Computer Mouse"`	0.889	0.819	❌	❌
`"a computer mouse"`	0.764	0.764	0.849	❌
`"computer mouse pad"`	0.764	0.764	0.849	❌

Matches for "a computer mouse"

	`word`	`lowercase`	`whitespace`	`field`
`"computer mouse"`	0.889	0.819	1.01	❌
`"Computer Mouse"`	0.889	0.819	❌	❌
`"a computer mouse"`	0.764	1.552	1.712	0.982
`"computer mouse pad"`	0.764	0.688	0.849	❌

Key observations:

Stop words don't prevent matches, but affect ranking
Scores differ for objects with/without stop words
lowercase and whitespace don't remove stop words from queries, giving users more control

Choosing your tokenization method

Based on what we've learned, here's guidance for choosing a tokenization method:

Use `word` (default) when:

Working with typical text data (articles, descriptions, names)
Users won't include exact punctuation in queries
Case-insensitivity is desired
You want forgiving search behavior

Use `lowercase` when:

Symbols like &, @, _, - are meaningful
Working with code snippets, email addresses, or technical notation
You want case-insensitivity but need to preserve symbols

Use `whitespace` when:

Case sensitivity is important (entity names, acronyms)
Symbols are meaningful
You can handle case-sensitivity in your query construction

Use `field` when:

Exact matches are required
Working with unique identifiers (URLs, IDs, exact email addresses)
You'll use wildcard filters for partial matches
Note: Can be slow with wildcards; use judiciously

Hybrid searches

A hybrid search combines keyword search and vector search results. Tokenization only impacts the keyword search portion; the vector search part uses the model's built-in tokenization.

Summary

You've learned how to:

Configure different tokenization methods for collection properties
Test and compare tokenization behavior with filters and searches
Understand the trade-offs between different tokenization methods
Choose the appropriate tokenization method for your use case

The key takeaway: tokenization is a core part of your search strategy. Start with word as a sensible default, but adjust based on your data characteristics and user expectations.

Next steps

Read more about tokenization in the Concepts page
Configure tokenization in your schema: Configuration reference
Learn about stop words: Stopwords configuration
Understand the inverted index: Inverted index concepts

Questions and feedback

If you have any questions or feedback, let us know in the user forum.

Technical questions

If you have questions feel free to post on our Community forum.

Documentation feedback

Leave feedback by opening a GitHub issue.

Additional resources

Need help?

Configure tokenization for keyword search

What you'll learn

Prerequisites

Create a demo collection

Add test data

Example 1: Punctuation and case sensitivity

Setup the filter function

Test "Clark:" vs "clark"

Example 2: Stop words

Example 3: Symbols and underscores

Keyword searches vs filters

Setup the search function

Keyword search differences

Stop words in keyword search

Choosing your tokenization method

Use `word` (default) when:

Use `lowercase` when:

Use `whitespace` when:

Use `field` when:

Hybrid searches

Summary

Next steps

Questions and feedback

Additional resources

Need help?

What you'll learn​

Prerequisites​

Create a demo collection​

Add test data​

Example 1: Punctuation and case sensitivity​

Setup the filter function​

Test "Clark:" vs "clark"​

Example 2: Stop words​

Example 3: Symbols and underscores​

Keyword searches vs filters​

Setup the search function​

Keyword search differences​

Stop words in keyword search​

Choosing your tokenization method​

Use word (default) when:​

Use lowercase when:​

Use whitespace when:​

Use field when:​

Hybrid searches​

Summary​

Next steps​

Questions and feedback​

What you'll learn

Prerequisites

Create a demo collection

Add test data

Example 1: Punctuation and case sensitivity

Setup the filter function

Test "Clark:" vs "clark"

Example 2: Stop words

Example 3: Symbols and underscores

Keyword searches vs filters

Setup the search function

Keyword search differences

Stop words in keyword search

Choosing your tokenization method

Use `word` (default) when:

Use `lowercase` when:

Use `whitespace` when:

Use `field` when:

Hybrid searches

Summary

Next steps

Questions and feedback