Keyword search
Keyword
search, also called "BM25 (Best match 25)" or "sparse vector" search, returns objects that have the highest BM25F scores.
Basic BM25 search
To use BM25 keyword search, define a search string.
- Python
- JS/TS
- Go
- Java
- GraphQL
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
limit=3
)
for o in response.objects:
print(o.properties)
const jeopardy = client.collections.use('JeopardyQuestion');
const result = await jeopardy.query.bm25('food', {
limit: 3,
})
for (let object of result.objects) {
console.log(JSON.stringify(object.properties, null, 2));
}
ctx := context.Background()
className := "JeopardyQuestion"
query := (&graphql.BM25ArgumentBuilder{}).WithQuery("food")
limit := int(3)
result, err := client.GraphQL().Get().
WithClassName(className).
WithFields(
graphql.Field{Name: "question"},
graphql.Field{Name: "answer"},
).
WithBM25(query).
WithLimit(limit).
Do(ctx)
Bm25Argument keyword = Bm25Argument.builder()
.query("food")
.build();
Fields fields = Fields.builder()
.fields(new Field[]{
Field.builder().name("question").build(),
Field.builder().name("answer").build(),
})
.build();
String query = GetBuilder.builder()
.className("JeopardyQuestion")
.fields(fields)
.withBm25Filter(keyword)
.limit(3)
.build()
.buildQuery();
Result<GraphQLResponse> result = client.graphQL().raw().withQuery(query).run();
{
Get {
JeopardyQuestion(
limit: 3
bm25: {
query: "food"
}
) {
question
answer
}
}
}
Example response
The response is like this:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
},
{
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"answer": "a closer grocer",
"question": "A nearer food merchant"
}
]
}
}
}
Search operators
v1.31
Search operators define the minimum number of query tokens that must be present in the object to be returned. The options are and
, or or
(default).
or
With the or
operator, the search returns objects that contain at least minimumOrTokensMatch
of the tokens in the search string.
- Python
- GraphQL
from weaviate.classes.query import BM25Operator
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="Australian mammal cute",
operator=BM25Operator.or_(minimum_match=1),
limit=3,
)
for o in response.objects:
print(o.properties)
{
Get {
JeopardyQuestion(
limit: 3
bm25: {
query: "Australian mammal cute"
searchOperator: {
operator: Or,
minimumOrTokensMatch: 2
}
}
) {
question
answer
}
}
}
and
With the and
operator, the search returns objects that contain all tokens in the search string.
- Python
- GraphQL
from weaviate.classes.query import BM25Operator
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="Australian mammal cute",
operator=BM25Operator.and_(), # Each result must include all tokens (e.g. "australian", "mammal", "cute")
limit=3,
)
for o in response.objects:
print(o.properties)
{
Get {
JeopardyQuestion(
limit: 3
bm25: {
query: "Australian mammal cute"
searchOperator: {
operator: And,
}
}
) {
question
answer
}
}
}
Retrieve BM25F scores
You can retrieve the BM25F score
values for each returned object.
- Python
- JS/TS
- Go
- Java
- GraphQL
from weaviate.classes.query import MetadataQuery
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
return_metadata=MetadataQuery(score=True),
limit=3
)
for o in response.objects:
print(o.properties)
print(o.metadata.score)
const jeopardy = client.collections.use('JeopardyQuestion');
const result = await jeopardy.query.bm25('food', {
returnMetadata: ['score'],
limit: 3
})
for (let object of result.objects) {
console.log(JSON.stringify(object.properties, null, 2));
console.log(object.metadata?.score);
}
ctx := context.Background()
className := "JeopardyQuestion"
query := (&graphql.BM25ArgumentBuilder{}).WithQuery("food")
limit := int(3)
result, err := client.GraphQL().Get().
WithClassName(className).
WithFields(
graphql.Field{Name: "question"},
graphql.Field{Name: "answer"},
graphql.Field{
Name: "_additional",
Fields: []graphql.Field{
{Name: "score"},
},
},
).
WithBM25(query).
WithLimit(limit).
Do(ctx)
Bm25Argument keyword = Bm25Argument.builder()
.query("food")
.build();
Fields fields = Fields.builder()
.fields(new Field[]{
Field.builder().name("question").build(),
Field.builder().name("answer").build(),
Field.builder().name("_additional").fields(new Field[]{
Field.builder().name("score").build()
}).build()
})
.build();
String query = GetBuilder.builder()
.className("JeopardyQuestion")
.fields(fields)
.withBm25Filter(keyword)
.limit(3)
.build()
.buildQuery();
Result<GraphQLResponse> result = client.graphQL().raw().withQuery(query).run();
{
Get {
JeopardyQuestion(
limit: 3
bm25: {
query: "food"
}
) {
question
answer
_additional {
score
}
}
}
}
Example response
The response is like this:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "3.0140665"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
},
{
"_additional": {
"score": "2.8725255"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "2.7672548"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
}
]
}
}
}
Search on selected properties only
A keyword search can be directed to only search a subset of object properties. In this example, the BM25 search only uses the question
property to produce the BM25F score.
- Python
- JS/TS
- Go
- Java
- GraphQL
from weaviate.classes.query import MetadataQuery
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="safety",
query_properties=["question"],
return_metadata=MetadataQuery(score=True),
limit=3
)
for o in response.objects:
print(o.properties)
print(o.metadata.score)
const jeopardy = client.collections.use('JeopardyQuestion');
const result = await jeopardy.query.bm25('safety', {
queryProperties: ['question'],
returnMetadata: ['score'],
limit: 3
})
for (let object of result.objects) {
console.log(JSON.stringify(object.properties, null, 2));
console.log(object.metadata?.score);
}
ctx := context.Background()
className := "JeopardyQuestion"
query := (&graphql.BM25ArgumentBuilder{}).WithQuery("safety").WithProperties("question")
limit := int(3)
result, err := client.GraphQL().Get().
WithClassName(className).
WithFields(
graphql.Field{Name: "question"},
graphql.Field{
Name: "_additional",
Fields: []graphql.Field{
{Name: "score"},
},
},
).
WithBM25(query).
WithLimit(limit).
Do(ctx)
Bm25Argument keyword = Bm25Argument.builder()
.query("food")
.properties(new String[]{ "question" })
.build();
Fields fields = Fields.builder()
.fields(new Field[]{
Field.builder().name("question").build(),
Field.builder().name("answer").build(),
Field.builder().name("_additional").fields(new Field[]{
Field.builder().name("score").build()
}).build()
})
.build();
String query = GetBuilder.builder()
.className("JeopardyQuestion")
.fields(fields)
.withBm25Filter(keyword)
.limit(3)
.build()
.buildQuery();
Result<GraphQLResponse> result = client.graphQL().raw().withQuery(query).run();
{
Get {
JeopardyQuestion(
limit: 3
bm25: {
query: "food"
properties: ["question"]
}
) {
question
answer
_additional {
score
}
}
}
}
Example response
The response is like this:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "3.7079012"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "3.4311616"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
},
{
"_additional": {
"score": "2.8312314"
},
"answer": "honey",
"question": "The primary source of this food is the Apis mellifera"
}
]
}
}
}
Use weights to boost properties
You can weight how much each property affects the overall BM25F score. This example boosts the question
property by a factor of 2 while the answer
property remains static.
- Python
- JS/TS
- Java
- Go
- GraphQL
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
query_properties=["question^2", "answer"],
limit=3
)
for o in response.objects:
print(o.properties)
const jeopardy = client.collections.use('JeopardyQuestion');
const result = await jeopardy.query.bm25('food', {
queryProperties: ['question^2', 'answer'],
returnMetadata: ['score'],
limit: 3
})
for (let object of result.objects) {
console.log(JSON.stringify(object.properties, null, 2));
console.log(object.metadata?.score);
}
Bm25Argument keyword = Bm25Argument.builder()
.query("food")
.properties(new String[]{ "question^2", "answer" })
.build();
Fields fields = Fields.builder()
.fields(new Field[]{
Field.builder().name("question").build(),
Field.builder().name("answer").build(),
Field.builder().name("_additional").fields(new Field[]{
Field.builder().name("score").build()
}).build()
})
.build();
String query = GetBuilder.builder()
.className("JeopardyQuestion")
.fields(fields)
.withBm25Filter(keyword)
.limit(3)
.build()
.buildQuery();
Result<GraphQLResponse> result = client.graphQL().raw().withQuery(query).run();
ctx := context.Background()
className := "JeopardyQuestion"
query := (&graphql.BM25ArgumentBuilder{}).WithQuery("food").WithProperties("question^2", "answer")
limit := int(3)
result, err := client.GraphQL().Get().
WithClassName(className).
WithFields(
graphql.Field{Name: "question"},
graphql.Field{Name: "answer"},
).
WithBM25(query).
WithLimit(limit).
Do(ctx)
{
Get {
JeopardyQuestion(
limit: 3
bm25: {
query: "food"
properties: ["question^2", "answer"]
}
) {
question
answer
_additional {
score
}
}
}
}
Example response
The response is like this:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "4.0038033"
},
"answer": "cake",
"question": "Devil's food & angel food are types of this dessert"
},
{
"_additional": {
"score": "3.8706005"
},
"answer": "a closer grocer",
"question": "A nearer food merchant"
},
{
"_additional": {
"score": "3.2457707"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other"
}
]
}
}
}
Set tokenization
The BM25 query string is tokenized before it is used to search for objects using the inverted index.
You must specify the tokenization method in the collection definition for each property.
- Python
- JS/TS
- Java
from weaviate.classes.config import Configure, Property, DataType, Tokenization
client.collections.create(
"Article",
vector_config=Configure.Vectors.text2vec_cohere(),
properties=[
Property(
name="title",
data_type=DataType.TEXT,
vectorize_property_name=True, # Use "title" as part of the value to vectorize
tokenization=Tokenization.LOWERCASE, # Use "lowercase" tokenization
description="The title of the article.", # Optional description
),
Property(
name="body",
data_type=DataType.TEXT,
skip_vectorization=True, # Don't vectorize this property
tokenization=Tokenization.WHITESPACE, # Use "whitespace" tokenization
),
],
)
import { vectors, dataType, tokenization } from 'weaviate-client';
const newCollection = await client.collections.create({
name: 'Article',
vectorizers: vectors.text2VecHuggingFace(),
properties: [
{
name: 'title',
dataType: dataType.TEXT,
vectorizePropertyName: true,
tokenization: tokenization.LOWERCASE // or 'lowercase'
},
{
name: 'body',
dataType: dataType.TEXT,
skipVectorization: true,
tokenization: tokenization.WHITESPACE // or 'whitespace'
},
],
})
Property titleProperty = Property.builder()
.name("title")
.description("title of the article")
.dataType(Arrays.asList(DataType.TEXT))
.tokenization(Tokenization.LOWERCASE)
.build();
Property bodyProperty = Property.builder()
.name("body")
.description("body of the article")
.dataType(Arrays.asList(DataType.TEXT))
.tokenization(Tokenization.WHITESPACE)
.build();
//Add the defined properties to the class
WeaviateClass articleClass = WeaviateClass.builder()
.className("Article")
.description("Article Class Description...")
.properties(Arrays.asList(titleProperty, bodyProperty))
.build();
Result<Boolean> result = client.schema().classCreator()
.withClass(articleClass)
.run();
For fuzzy matching and typo tolerance, use trigram
tokenization. See the fuzzy matching section above for details.
limit
& offset
Use limit
to set a fixed maximum number of objects to return.
Optionally, use offset
to paginate the results.
- Python
- JS/TS
- Go
- Java
- GraphQL
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="safety",
limit=3,
offset=1
)
for o in response.objects:
print(o.properties)
const jeopardy = client.collections.use('JeopardyQuestion');
const result = await jeopardy.query.bm25('safety', {
limit: 3,
offset: 1
})
for (let object of result.objects) {
console.log(JSON.stringify(object.properties, null, 2));
}
ctx := context.Background()
className := "JeopardyQuestion"
query := (&graphql.BM25ArgumentBuilder{}).WithQuery("safety")
limit := int(3)
offset := int(1)
result, err := client.GraphQL().Get().
WithClassName(className).
WithFields(
graphql.Field{Name: "question"},
graphql.Field{Name: "answer"},
).
WithBM25(query).
WithLimit(limit).
WithOffset(offset).
Do(ctx)
Bm25Argument keyword = Bm25Argument.builder()
.query("safety")
.build();
Fields fields = Fields.builder()
.fields(new Field[]{
Field.builder().name("question").build(),
Field.builder().name("answer").build(),
})
.build();
String query = GetBuilder.builder()
.className("JeopardyQuestion")
.fields(fields)
.withBm25Filter(keyword)
.limit(3)
.offset(1)
.build()
.buildQuery();
Result<GraphQLResponse> result = client.graphQL().raw().withQuery(query).run();
{
Get {
JeopardyQuestion(
bm25: {
query: "safety"
}
limit: 3
) {
question
answer
_additional {
score
}
}
}
}
Limit result groups
To limit results to groups of similar distances to the query, use the autocut
filter to set the number of groups to return.
- Python
- JS/TS
- Go
- Java
- GraphQL
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="safety",
auto_limit=1
)
for o in response.objects:
print(o.properties)
const jeopardy = client.collections.use('JeopardyQuestion');
const result = await jeopardy.query.bm25('safety', {
autoLimit: 1,
})
for (let object of result.objects) {
console.log(JSON.stringify(object.properties, null, 2));
}
ctx := context.Background()
className := "JeopardyQuestion"
query := (&graphql.BM25ArgumentBuilder{}).WithQuery("safety")
autoLimit := int(1)
result, err := client.GraphQL().Get().
WithClassName(className).
WithFields(
graphql.Field{Name: "question"},
graphql.Field{Name: "answer"},
).
WithBM25(query).
WithAutocut(autoLimit).
Do(ctx)
Bm25Argument keyword = Bm25Argument.builder()
.query("safety")
.build();
Fields fields = Fields.builder()
.fields(new Field[]{
Field.builder().name("question").build(),
Field.builder().name("answer").build(),
})
.build();
String query = GetBuilder.builder()
.className("JeopardyQuestion")
.fields(fields)
.withBm25Filter(keyword)
.autocut(1)
.build()
.buildQuery();
Result<GraphQLResponse> result = client.graphQL().raw().withQuery(query).run();
{
Get {
JeopardyQuestion(
bm25: {
query: "safety"
}
autocut: 1
) {
question
answer
_additional {
score
}
}
}
}
Example response
The response is like this:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "2.6768136"
},
"answer": "OSHA (Occupational Safety and Health Administration)",
"question": "The government admin. was created in 1971 to ensure occupational health & safety standards"
}
]
}
}
}
Group results
v1.25
Define criteria to group search results.
- Python
- Java
from weaviate.classes.query import GroupBy
jeopardy = client.collections.use("JeopardyQuestion")
# Grouping parameters
group_by = GroupBy(
prop="round", # group by this property
objects_per_group=3, # maximum objects per group
number_of_groups=2, # maximum number of groups
)
# Query
response = jeopardy.query.bm25(
query="California",
group_by=group_by
)
for grp_name, grp_content in response.groups.items():
print(grp_name, grp_content.objects)
Bm25Argument keyword = Bm25Argument.builder()
.query("California")
.build();
Fields fields = Fields.builder()
.fields(new Field[]{
Field.builder().name("round").build()
})
.build();
GroupByArgument groupBy = GroupByArgument.builder()
.path(new String[]{ "round" })
.groups(1)
.objectsPerGroup(3)
.build();
String query = GetBuilder.builder()
.className("JeopardyQuestion")
.fields(fields)
.withBm25Filter(keyword)
.withGroupByArgument(groupBy)
.build()
.buildQuery();
Result<GraphQLResponse> result = client.graphQL().raw().withQuery(query).run();
Example response
The response is like this:
'Jeopardy!'
'Double Jeopardy!'
Filter results
For more specific results, use a filter
to narrow your search.
- Python
- JS/TS
- Go
- Java
- GraphQL
from weaviate.classes.query import Filter
jeopardy = client.collections.use("JeopardyQuestion")
response = jeopardy.query.bm25(
query="food",
filters=Filter.by_property("round").equal("Double Jeopardy!"),
return_properties=["answer", "question", "round"], # return these properties
limit=3
)
for o in response.objects:
print(o.properties)
const jeopardy = client.collections.use('JeopardyQuestion');
const result = await jeopardy.query.bm25('food', {
limit: 3,
returnMetadata: ['score'],
filters: jeopardy.filter.byProperty('round').equal('Double Jeopardy!'),
returnProperties: ['question', 'answer', 'round'],
})
for (let object of result.objects) {
console.log(JSON.stringify(object.properties, null, 2));
}
ctx := context.Background()
className := "JeopardyQuestion"
query := (&graphql.BM25ArgumentBuilder{}).WithQuery("food")
limit := int(3)
filter := filters.Where().
WithPath([]string{"round"}).
WithOperator(filters.Equal).
WithValueString("Double Jeopardy!")
result, err := client.GraphQL().Get().
WithClassName(className).
WithFields(
graphql.Field{Name: "answer"},
graphql.Field{Name: "question"},
graphql.Field{Name: "round"},
).
WithBM25(query).
WithWhere(filter).
WithLimit(limit).
Do(ctx)
Bm25Argument keyword = Bm25Argument.builder()
.query("food")
.build();
Fields fields = Fields.builder()
.fields(new Field[]{
Field.builder().name("question").build(),
Field.builder().name("answer").build(),
Field.builder().name("round").build(),
Field.builder().name("_additional").fields(new Field[]{
Field.builder().name("score").build()
}).build()
})
.build();
WhereFilter whereFilter = WhereFilter.builder()
.path(new String[]{ "round" }) // Path to filter by
.operator(Operator.Equal)
.valueText("Double Jeopardy!")
.build();
WhereArgument whereArgument = WhereArgument.builder()
.filter(whereFilter)
.build();
String query = GetBuilder.builder()
.className("JeopardyQuestion")
.fields(fields)
.withBm25Filter(keyword)
.withWhereFilter(whereArgument)
.limit(3)
.build()
.buildQuery();
Result<GraphQLResponse> result = client.graphQL().raw().withQuery(query).run();
{
Get {
JeopardyQuestion(
limit: 3
bm25: {
query: "food"
}
where: {
path: ["round"]
operator: Equal
valueText: "Double Jeopardy!"
}
) {
question
answer
_additional {
score
}
}
}
}
Example response
The response is like this:
{
"data": {
"Get": {
"JeopardyQuestion": [
{
"_additional": {
"score": "3.0140665"
},
"answer": "food stores (supermarkets)",
"question": "This type of retail store sells more shampoo & makeup than any other",
"round": "Double Jeopardy!"
},
{
"_additional": {
"score": "1.9633813"
},
"answer": "honey",
"question": "The primary source of this food is the Apis mellifera",
"round": "Double Jeopardy!"
},
{
"_additional": {
"score": "1.6719631"
},
"answer": "pseudopods",
"question": "Amoebas use temporary extensions called these to move or to surround & engulf food",
"round": "Double Jeopardy!"
}
]
}
}
}
Tokenization
Weaviate converts filter terms into tokens. The default tokenization is word
. The word
tokenizer keeps alphanumeric characters, lowercase them and splits on whitespace. It converts a string like "Test_domain_weaviate" into "test", "domain", and "weaviate".
For details and additional tokenization methods, see Tokenization.
Fuzzy matching
You can enable fuzzy matching and typo tolerance in BM25 searches by using trigram
tokenization. This technique breaks text into overlapping 3-character sequences, allowing BM25 to find matches even when there are spelling errors or variations.
This enables matching between similar but not identical strings because they share many trigrams:
"Morgn"
and"Morgan"
share trigrams like"org", "rga", "gan"
Set the tokenization method to trigram
at the property level when creating your collection:
- Python
- JS/TS
from weaviate.classes.config import Configure, Property, DataType, Tokenization
client.collections.create(
"Article",
vector_config=Configure.Vectors.text2vec_cohere(),
properties=[
Property(
name="title",
data_type=DataType.TEXT,
tokenization=Tokenization.TRIGRAM, # Use "trigram" tokenization
),
],
)
import { vectors, dataType, tokenization } from 'weaviate-client';
const newCollection = await client.collections.create({
name: 'Article',
vectorizers: vectors.text2VecHuggingFace(),
properties: [
{
name: 'title',
dataType: dataType.TEXT,
tokenization: tokenization.TRIGRAM // Use "trigram" tokenization
},
],
})
- Use trigram tokenization selectively on fields that need fuzzy matching. Filtering behavior will change significantly, as text filtering will be done based on trigram-tokenized text, instead of whole words
- Keep exact-match fields with
word
orfield
tokenization for precision.
Further resources
- Connect to Weaviate
- API References: Search operators # BM25
- Reference: Tokenization options
- Weaviate Academy: Tokenization
Questions and feedback
If you have any questions or feedback, let us know in the user forum.