Keyword search

Keyword search, also called "BM25 (Best match 25)" or "sparse vector" search, returns objects that have the highest BM25F scores.

Prefer natural language queries?

The Query Agent translates plain English questions into optimized Weaviate queries automatically - no manual query construction needed.

Cloud only

Basic BM25 search

To use BM25 keyword search, define a search string.

API docs

More info

jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
    query="food",
    limit=3
)

for o in response.objects:
    print(o.properties)

Example response

The response is like this:

{
  "data": {
    "Get": {
      "JeopardyQuestion": [
        {
          "answer": "food stores (supermarkets)",
          "question": "This type of retail store sells more shampoo & makeup than any other"
        },
        {
          "answer": "cake",
          "question": "Devil's food & angel food are types of this dessert"
        },
        {
          "answer": "a closer grocer",
          "question": "A nearer food merchant"
        }
      ]
    }
  }
}

Search operators

Added in v1.31

Search operators define the minimum number of query tokens that must be present in the object to be returned. The options are and, or or (default).

`or`

With the or operator, the search returns objects that contain at least minimumOrTokensMatch of the tokens in the search string.

API docs

More info

from weaviate.classes.query import BM25Operator

jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
    query="Australian mammal cute",
    operator=BM25Operator.or_(minimum_match=1),
    limit=3,
)

for o in response.objects:
    print(o.properties)

`and`

With the and operator, the search returns objects that contain all tokens in the search string.

API docs

More info

from weaviate.classes.query import BM25Operator

jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
    query="Australian mammal cute",
    operator=BM25Operator.and_(),  # Each result must include all tokens (e.g. "australian", "mammal", "cute")
    limit=3,
)

for o in response.objects:
    print(o.properties)

Retrieve BM25F scores

You can retrieve the BM25F score values for each returned object.

API docs

More info

from weaviate.classes.query import MetadataQuery

jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
    query="food",
    return_metadata=MetadataQuery(score=True),
    limit=3
)

for o in response.objects:
    print(o.properties)
    print(o.metadata.score)

Example response

The response is like this:

{
  "data": {
    "Get": {
      "JeopardyQuestion": [
        {
          "_additional": {
            "score": "3.0140665"
          },
          "answer": "food stores (supermarkets)",
          "question": "This type of retail store sells more shampoo & makeup than any other"
        },
        {
          "_additional": {
            "score": "2.8725255"
          },
          "answer": "cake",
          "question": "Devil's food & angel food are types of this dessert"
        },
        {
          "_additional": {
            "score": "2.7672548"
          },
          "answer": "a closer grocer",
          "question": "A nearer food merchant"
        }
      ]
    }
  }
}

Search on selected properties only

A keyword search can be directed to only search a subset of object properties. In this example, the BM25 search only uses the question property to produce the BM25F score.

API docs

More info

from weaviate.classes.query import MetadataQuery

jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
    query="safety",
    query_properties=["question"],
    return_metadata=MetadataQuery(score=True),
    limit=3
)

for o in response.objects:
    print(o.properties)
    print(o.metadata.score)

Example response

The response is like this:

{
  "data": {
    "Get": {
      "JeopardyQuestion": [
        {
          "_additional": {
            "score": "3.7079012"
          },
          "answer": "cake",
          "question": "Devil's food & angel food are types of this dessert"
        },
        {
          "_additional": {
            "score": "3.4311616"
          },
          "answer": "a closer grocer",
          "question": "A nearer food merchant"
        },
        {
          "_additional": {
            "score": "2.8312314"
          },
          "answer": "honey",
          "question": "The primary source of this food is the Apis mellifera"
        }
      ]
    }
  }
}

Use weights to boost properties

You can weight how much each property affects the overall BM25F score. This example boosts the question property by a factor of 2 while the answer property remains static.

API docs

More info

jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
    query="food",
    query_properties=["question^2", "answer"],
    limit=3
)

for o in response.objects:
    print(o.properties)

Example response

The response is like this:

{
  "data": {
    "Get": {
      "JeopardyQuestion": [
        {
          "_additional": {
            "score": "4.0038033"
          },
          "answer": "cake",
          "question": "Devil's food & angel food are types of this dessert"
        },
        {
          "_additional": {
            "score": "3.8706005"
          },
          "answer": "a closer grocer",
          "question": "A nearer food merchant"
        },
        {
          "_additional": {
            "score": "3.2457707"
          },
          "answer": "food stores (supermarkets)",
          "question": "This type of retail store sells more shampoo & makeup than any other"
        }
      ]
    }
  }
}

Set tokenization

The BM25 query string is tokenized before it is used to search for objects using the inverted index.

You must specify the tokenization method in the collection definition for each property.

API docs

More info

from weaviate.classes.config import Configure, Property, DataType, Tokenization

client.collections.create(
    "Article",
    vector_config=Configure.Vectors.text2vec_cohere(),
    properties=[
        Property(
            name="title",
            data_type=DataType.TEXT,
            vectorize_property_name=True,  # Use "title" as part of the value to vectorize
            tokenization=Tokenization.LOWERCASE,  # Use "lowercase" tokenization
            description="The title of the article.",  # Optional description
        ),
        Property(
            name="body",
            data_type=DataType.TEXT,
            skip_vectorization=True,  # Don't vectorize this property
            tokenization=Tokenization.WHITESPACE,  # Use "whitespace" tokenization
        ),
    ],
)

Tokenization and fuzzy matching

For fuzzy matching and typo tolerance, use trigram tokenization. See the fuzzy matching section above for details.

`limit` & `offset`

Use limit to set a fixed maximum number of objects to return.

Optionally, use offset to paginate the results.

API docs

More info

jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
    query="safety",
    limit=3,
    offset=1
)

for o in response.objects:
    print(o.properties)

Limit result groups

To limit results to groups of similar distances to the query, use the autocut filter to set the number of groups to return.

API docs

More info

jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
    query="safety",
    auto_limit=1
)

for o in response.objects:
    print(o.properties)

Example response

The response is like this:

{
  "data": {
    "Get": {
      "JeopardyQuestion": [
        {
          "_additional": {
            "score": "2.6768136"
          },
          "answer": "OSHA (Occupational Safety and Health Administration)",
          "question": "The government admin. was created in 1971 to ensure occupational health & safety standards"
        }
      ]
    }
  }
}

Group results

Added in v1.25

Define criteria to group search results.

API docs

More info

from weaviate.classes.query import GroupBy

jeopardy = client.collections.use("JeopardyQuestion")

# Grouping parameters
group_by = GroupBy(
    prop="round",  # group by this property
    objects_per_group=3,  # maximum objects per group
    number_of_groups=2,  # maximum number of groups
)

# Query
response = jeopardy.query.bm25(
    query="California",
    group_by=group_by
)

for grp_name, grp_content in response.groups.items():
    print(grp_name, grp_content.objects)

Example response

The response is like this:

'Jeopardy!'
'Double Jeopardy!'

Filter results

For more specific results, use a filter to narrow your search.

API docs

More info

from weaviate.classes.query import Filter

jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
    query="food",
    filters=Filter.by_property("round").equal("Double Jeopardy!"),
    return_properties=["answer", "question", "round"], # return these properties
    limit=3
)

for o in response.objects:
    print(o.properties)

Example response

The response is like this:

{
  "data": {
    "Get": {
      "JeopardyQuestion": [
        {
          "_additional": {
            "score": "3.0140665"
          },
          "answer": "food stores (supermarkets)",
          "question": "This type of retail store sells more shampoo & makeup than any other",
          "round": "Double Jeopardy!"
        },
        {
          "_additional": {
            "score": "1.9633813"
          },
          "answer": "honey",
          "question": "The primary source of this food is the Apis mellifera",
          "round": "Double Jeopardy!"
        },
        {
          "_additional": {
            "score": "1.6719631"
          },
          "answer": "pseudopods",
          "question": "Amoebas use temporary extensions called these to move or to surround & engulf food",
          "round": "Double Jeopardy!"
        }
      ]
    }
  }
}

Tokenization

Weaviate converts filter terms into tokens. The default tokenization is word. The word tokenizer keeps alphanumeric characters, lowercase them and splits on whitespace. It converts a string like "Test_domain_weaviate" into "test", "domain", and "weaviate".

For details and additional tokenization methods, see Tokenization.

Fuzzy matching

You can enable fuzzy matching and typo tolerance in BM25 searches by using trigram tokenization. This technique breaks text into overlapping 3-character sequences, allowing BM25 to find matches even when there are spelling errors or variations.

This enables matching between similar but not identical strings because they share many trigrams:

"Morgn" and "Morgan" share trigrams like "org", "rga", "gan"

Set the tokenization method to trigram at the property level when creating your collection:

API docs

More info

from weaviate.classes.config import Configure, Property, DataType, Tokenization

client.collections.create(
    "Article",
    vector_config=Configure.Vectors.text2vec_cohere(),
    properties=[
        Property(
            name="title",
            data_type=DataType.TEXT,
            tokenization=Tokenization.TRIGRAM,  # Use "trigram" tokenization
        ),
    ],
)

Best practices

Use trigram tokenization selectively on fields that need fuzzy matching. Filtering behavior will change significantly, as text filtering will be done based on trigram-tokenized text, instead of whole words
Keep exact-match fields with word or field tokenization for precision.

Further resources

Questions and feedback

If you have any questions or feedback, let us know in the user forum.

Technical questions

If you have questions feel free to post on our Community forum.

Documentation feedback

Leave feedback by opening a GitHub issue.

Additional resources

Need help?

Keyword search

Basic BM25 search

Search operators

`or`

`and`

Retrieve BM25F scores

Search on selected properties only

Use weights to boost properties

Set tokenization

`limit` & `offset`

Limit result groups

Group results

Filter results

Tokenization

Fuzzy matching

Further resources

Questions and feedback

Additional resources

Need help?

Basic BM25 search​

Search operators​

or​

and​

Retrieve BM25F scores​

Search on selected properties only​

Use weights to boost properties​

Set tokenization​

limit & offset​

Limit result groups​

Group results​

Filter results​

Tokenization​

Fuzzy matching​

Further resources​

Questions and feedback​

Basic BM25 search

Search operators

`or`

`and`

Retrieve BM25F scores

Search on selected properties only

Use weights to boost properties

Set tokenization

`limit` & `offset`

Limit result groups

Group results

Filter results

Tokenization

Fuzzy matching

Further resources

Questions and feedback