Keyword search
Keyword search, also called "BM25 (Best match 25)" or "sparse vector" search, returns objects that have the highest BM25F scores.
The Query Agent translates plain English questions into optimized Weaviate queries automatically - no manual query construction needed.
Basic BM25 search
To use BM25 keyword search, define a search string.
If a snippet doesn't work or you have feedback, please open a GitHub issue.
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
limit=3
)
for o in response.objects:
print(o.properties)
Example response
The response is like this:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
},
{
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"answer": "a closer grocer",
"question": "A nearer food merchant"
}
]
}
}
}
Search operators
v1.31Search operators define the minimum number of query tokens that must be present in the object to be returned. The options are and, or or (default).
or
With the or operator, the search returns objects that contain at least minimumOrTokensMatch of the tokens in the search string.
If a snippet doesn't work or you have feedback, please open a GitHub issue.
from weaviate.classes.query import BM25Operator
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="Australian mammal cute",
operator=BM25Operator.or_(minimum_match=1),
limit=3,
)
for o in response.objects:
print(o.properties)
and
With the and operator, the search returns objects that contain all tokens in the search string.
If a snippet doesn't work or you have feedback, please open a GitHub issue.
from weaviate.classes.query import BM25Operator
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="Australian mammal cute",
operator=BM25Operator.and_(), # Each result must include all tokens (e.g. "australian", "mammal", "cute")
limit=3,
)
for o in response.objects:
print(o.properties)
Retrieve BM25F scores
You can retrieve the BM25F score values for each returned object.
If a snippet doesn't work or you have feedback, please open a GitHub issue.
from weaviate.classes.query import MetadataQuery
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
return_metadata=MetadataQuery(score=True),
limit=3
)
for o in response.objects:
print(o.properties)
print(o.metadata.score)
Example response
The response is like this:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "3.0140665"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
},
{
"_additional": {
"score": "2.8725255"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "2.7672548"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
}
]
}
}
}
Search on selected properties only
A keyword search can be directed to only search a subset of object properties. In this example, the BM25 search only uses the question property to produce the BM25F score.
If a snippet doesn't work or you have feedback, please open a GitHub issue.
from weaviate.classes.query import MetadataQuery
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="safety",
query_properties=["question"],
return_metadata=MetadataQuery(score=True),
limit=3
)
for o in response.objects:
print(o.properties)
print(o.metadata.score)
Example response
The response is like this:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "3.7079012"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "3.4311616"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
},
{
"_additional": {
"score": "2.8312314"
},
"answer": "honey",
"question": "The primary source of this food is the Apis mellifera"
}
]
}
}
}
Use weights to boost properties
You can weight how much each property affects the overall BM25F score. This example boosts the question property by a factor of 2 while the answer property remains static.
If a snippet doesn't work or you have feedback, please open a GitHub issue.
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
query_properties=["question^2", "answer"],
limit=3
)
for o in response.objects:
print(o.properties)
Example response
The response is like this:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "4.0038033"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "3.8706005"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
},
{
"_additional": {
"score": "3.2457707"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
}
]
}
}
}
Set tokenization
The BM25 query string is tokenized before it is used to search for objects using the inverted index.
You must specify the tokenization method in the collection definition for each property.
If a snippet doesn't work or you have feedback, please open a GitHub issue.
from weaviate.classes.config import Configure, Property, DataType, Tokenization
client.collections.create(
"Article",
vector_config=Configure.Vectors.text2vec_cohere(),
properties=[
Property(
name="title",
data_type=DataType.TEXT,
vectorize_property_name=True, # Use "title" as part of the value to vectorize
tokenization=Tokenization.LOWERCASE, # Use "lowercase" tokenization
description="The title of the article.", # Optional description
),
Property(
name="body",
data_type=DataType.TEXT,
skip_vectorization=True, # Don't vectorize this property
tokenization=Tokenization.WHITESPACE, # Use "whitespace" tokenization
),
],
)
For fuzzy matching and typo tolerance, use trigram tokenization. See the fuzzy matching section above for details.
Accent folding
v1.37This is a preview feature. The API may change in future releases.
Text properties can enable accent folding via textAnalyzer.asciiFold to normalize accented characters to their ASCII equivalents during both indexing and querying. For example, "Café Crème" becomes searchable as "cafe creme" and vice versa. This improves BM25 recall for multilingual content without requiring users to type exact accented characters.
See Inverted index: Accent folding for configuration details.
Stopwords
v1.37This is a preview feature. The API may change in future releases.
By default, Weaviate filters out common English stopwords (like "a", "the", "is") from BM25 scoring. You can customize this behavior:
- Custom presets: Define named stopword lists per collection via
invertedIndexConfig.stopwordPresets— useful for non-English languages or domain-specific terms. - Per-property overrides: Assign different stopword presets to individual properties via
textAnalyzer.stopwordPreset— useful for multilingual collections where each property contains text in a different language.
Stopwords are still indexed and only filtered at query time, so changing the configuration does not require reindexing.
See Inverted index: Custom stopword presets and the stopwords configuration reference for details.
limit & offset
Use limit to set a fixed maximum number of objects to return.
Optionally, use offset to paginate the results.
If a snippet doesn't work or you have feedback, please open a GitHub issue.
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="safety",
limit=3,
offset=1
)
for o in response.objects:
print(o.properties)
Limit result groups
To limit results to groups of similar distances to the query, use the autocut filter to set the number of groups to return.
If a snippet doesn't work or you have feedback, please open a GitHub issue.
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="safety",
auto_limit=1
)
for o in response.objects:
print(o.properties)
Example response
The response is like this:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "2.6768136"
},
"answer": "OSHA (Occupational Safety and Health Administration)",
"question": "The government admin. was created in 1971 to ensure occupational health & safety standards"
}
]
}
}
}
Group results
Define criteria to group search results.
If a snippet doesn't work or you have feedback, please open a GitHub issue.
from weaviate.classes.query import GroupBy
jeopardy = client.collections.use("JeopardyQuestion")
# Grouping parameters
group_by = GroupBy(
prop="round", # group by this property
objects_per_group=3, # maximum objects per group
number_of_groups=2, # maximum number of groups
)
# Query
response = jeopardy.query.bm25(
query="California",
group_by=group_by
)
for grp_name, grp_content in response.groups.items():
print(grp_name, grp_content.objects)
Example response
The response is like this:
'Jeopardy!'
'Double Jeopardy!'
Filter results
For more specific results, use a filter to narrow your search.
If a snippet doesn't work or you have feedback, please open a GitHub issue.
from weaviate.classes.query import Filter
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
filters=Filter.by_property("round").equal("Double Jeopardy!"),
return_properties=["answer", "question", "round"], # return these properties
limit=3
)
for o in response.objects:
print(o.properties)
Example response
The response is like this:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "3.0140665"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other",
"round": "Double Jeopardy!"
},
{
"_additional": {
"score": "1.9633813"
},
"answer": "honey",
"question": "The primary source of this food is the Apis mellifera",
"round": "Double Jeopardy!"
},
{
"_additional": {
"score": "1.6719631"
},
"answer": "pseudopods",
"question": "Amoebas use temporary extensions called these to move or to surround & engulf food",
"round": "Double Jeopardy!"
}
]
}
}
}
Tokenization
Weaviate converts filter terms into tokens. The default tokenization is word. The word tokenizer keeps alphanumeric characters, lowercase them and splits on whitespace. It converts a string like "Test_domain_weaviate" into "test", "domain", and "weaviate".
For details and additional tokenization methods, see Tokenization.
Fuzzy matching
You can enable fuzzy matching and typo tolerance in BM25 searches by using trigram tokenization. This technique breaks text into overlapping 3-character sequences, allowing BM25 to find matches even when there are spelling errors or variations.
This enables matching between similar but not identical strings because they share many trigrams:
"Morgn"and"Morgan"share trigrams like"org", "rga", "gan"
Set the tokenization method to trigram at the property level when creating your collection:
If a snippet doesn't work or you have feedback, please open a GitHub issue.
from weaviate.classes.config import Configure, Property, DataType, Tokenization
client.collections.create(
"Article",
vector_config=Configure.Vectors.text2vec_cohere(),
properties=[
Property(
name="title",
data_type=DataType.TEXT,
tokenization=Tokenization.TRIGRAM, # Use "trigram" tokenization
),
],
)
- Use trigram tokenization selectively on fields that need fuzzy matching. Filtering behavior will change significantly, as text filtering will be done based on trigram-tokenized text, instead of whole words
- Keep exact-match fields with
wordorfieldtokenization for precision.
Further resources
Questions and feedback
If you have any questions or feedback, let us know in the user forum.
