Inverted index
An inverted index is a data structure in Weaviate that enables efficient text search and filtering operations.
Additional information
In Weaviate, the inverted index supports search capabilities such as keyword search, filtering, and range queries. An inverted index maps from terms (tokens) back to the objects that contain them. This mapping allows Weaviate to quickly identify which objects contain specific terms or match certain criteria during search queries.
You can enable inverted indexes on properties and adjust various parameters that control indexing behavior and tokenization strategies. Proper configuration of these parameters is crucial for optimizing both search performance and storage efficiency.
Enable inverted index for keyword searches and filtering
Inverted index parameters control how individual properties are indexed for search and filtering operations. These parameters determine whether specific properties can be searched, filtered, or used in range queries.
Enabling inverted index
The inverted index in Weaviate can be enabled through parameters at the property level:
index_filterable
- Controls whether a property can be used in where filters. When set to true
, the property values are indexed for efficient filtering operations. Disable this for properties that don't need filtering to save storage space.
index_searchable
- Determines whether a property participates in keyword search queries. When true
, the property's text content is tokenized and indexed for search. Set to false
for properties that shouldn't be searchable to improve performance.
index_range_filters
- Enables range filtering capabilities (greater than, less than, etc.) for numerical and date properties. When enabled, additional indexing structures are created to support efficient range queries.
- Python
- JS/TS
- Java
- Go
from weaviate.classes.config import Configure, Property, DataType
client.collections.create(
"Article",
# Additional settings not shown
properties=[
Property(
name="title",
data_type=DataType.TEXT,
index_filterable=True,
index_searchable=True,
),
Property(
name="chunk",
data_type=DataType.TEXT,
index_filterable=True,
index_searchable=True,
),
Property(
name="chunk_number",
data_type=DataType.INT,
index_range_filters=True,
),
],
)
await client.collections.create({
name: 'Article',
properties: [
{
name: 'title',
dataType: dataType.TEXT,
indexFilterable: true,
indexSearchable: true,
},
{
name: 'chunk',
dataType: dataType.TEXT,
indexFilterable: true,
indexSearchable: true,
},
{
name: 'chunk_no',
dataType: dataType.INT,
indexRangeFilters: true,
},
],
})
// Create properties with specific indexing configurations
Property titleProperty = Property.builder()
.name("title")
.dataType(Arrays.asList(DataType.TEXT))
.indexFilterable(true)
.indexSearchable(true)
.build();
Property chunkProperty = Property.builder()
.name("chunk")
.dataType(Arrays.asList(DataType.INT))
.indexRangeFilters(true)
.build();
// Configure BM25 settings
BM25Config bm25Config = BM25Config.builder()
.b(0.7f)
.k1(1.25f)
.build();
// Configure inverted index with BM25 and other settings
InvertedIndexConfig invertedIndexConfig = InvertedIndexConfig.builder()
.bm25(bm25Config)
.indexNullState(true)
.indexPropertyLength(true)
.indexTimestamps(true)
.build();
// Create the Article collection with properties and inverted index configuration
WeaviateClass articleCollection = WeaviateClass.builder()
.className(collectionName)
.properties(Arrays.asList(titleProperty, chunkProperty))
.invertedIndexConfig(invertedIndexConfig)
.build();
// Add the collection to the schema
Result<Boolean> result = client.schema().classCreator()
.withClass(articleCollection)
.run();
vTrue := true
vFalse := false
articleClass := &models.Class{
Class: "Article",
Description: "Collection of articles",
Properties: []*models.Property{
{
Name: "title",
DataType: schema.DataTypeText.PropString(),
Tokenization: "lowercase",
IndexFilterable: &vTrue,
IndexSearchable: &vFalse,
},
{
Name: "chunk",
DataType: schema.DataTypeText.PropString(),
Tokenization: "word",
IndexFilterable: &vTrue,
IndexSearchable: &vTrue,
},
{
Name: "chunk_no",
DataType: schema.DataTypeInt.PropString(),
IndexRangeFilters: &vTrue,
},
},
InvertedIndexConfig: &models.InvertedIndexConfig{
Bm25: &models.BM25Config{
B: 0.7,
K1: 1.25,
},
IndexNullState: true,
IndexPropertyLength: true,
IndexTimestamps: true,
},
}
Set inverted index parameters
Inverted index parameters control the overall behavior of the inverted index for an entire collection. These parameters affect ranking algorithms, null value handling, and timestamp indexing across all properties in the collection.
Inverted index parameters
The inverted index in Weaviate can be configured through various parameters at the collection level:
bm25_b
- Controls the degree of normalization by document length in the BM25 ranking algorithm. Values range from 0 to 1, where 0 means no length normalization and 1 means full normalization. Higher values favor shorter documents.
bm25_k1
- Controls term frequency saturation in BM25. Higher values make term frequency more important, while lower values reduce the impact of term frequency on scoring.
index_null_state
- Determines whether null values are indexed. When enabled, you can filter for objects that have null values in specific properties.
index_property_length
- Controls whether the length of text properties is indexed. When enabled, allows filtering based on text length and can improve certain ranking algorithms.
index_timestamps
- Enables indexing of creation and update timestamps for objects, allowing filtering and sorting operations.
- Python
- JS/TS
- Java
- Go
from weaviate.classes.config import Configure, Property, DataType
client.collections.create(
"Article",
# Additional settings not shown
inverted_index_config=Configure.inverted_index(
bm25_b=0.7,
bm25_k1=1.25,
index_null_state=True,
index_property_length=True,
index_timestamps=True,
),
)
import { dataType } from 'weaviate-client';
await client.collections.create({
name: 'Article',
invertedIndex: {
bm25: {
b: 0.7,
k1: 1.25
},
indexNullState: true,
indexPropertyLength: true,
indexTimestamps: true
}
})
// Create properties with specific indexing configurations
Property titleProperty = Property.builder()
.name("title")
.dataType(Arrays.asList(DataType.TEXT))
.indexFilterable(true)
.indexSearchable(true)
.build();
Property chunkProperty = Property.builder()
.name("chunk")
.dataType(Arrays.asList(DataType.INT))
.indexRangeFilters(true)
.build();
// Configure BM25 settings
BM25Config bm25Config = BM25Config.builder()
.b(0.7f)
.k1(1.25f)
.build();
// Configure inverted index with BM25 and other settings
InvertedIndexConfig invertedIndexConfig = InvertedIndexConfig.builder()
.bm25(bm25Config)
.indexNullState(true)
.indexPropertyLength(true)
.indexTimestamps(true)
.build();
// Create the Article collection with properties and inverted index configuration
WeaviateClass articleCollection = WeaviateClass.builder()
.className(collectionName)
.properties(Arrays.asList(titleProperty, chunkProperty))
.invertedIndexConfig(invertedIndexConfig)
.build();
// Add the collection to the schema
Result<Boolean> result = client.schema().classCreator()
.withClass(articleCollection)
.run();
vTrue := true
vFalse := false
articleClass := &models.Class{
Class: "Article",
Description: "Collection of articles",
Properties: []*models.Property{
{
Name: "title",
DataType: schema.DataTypeText.PropString(),
Tokenization: "lowercase",
IndexFilterable: &vTrue,
IndexSearchable: &vFalse,
},
{
Name: "chunk",
DataType: schema.DataTypeText.PropString(),
Tokenization: "word",
IndexFilterable: &vTrue,
IndexSearchable: &vTrue,
},
{
Name: "chunk_no",
DataType: schema.DataTypeInt.PropString(),
IndexRangeFilters: &vTrue,
},
},
InvertedIndexConfig: &models.InvertedIndexConfig{
Bm25: &models.BM25Config{
B: 0.7,
K1: 1.25,
},
IndexNullState: true,
IndexPropertyLength: true,
IndexTimestamps: true,
},
}
Set tokenization type for property
Configure a tokenization method for each property individually.
Tokenization methods
Tokenization determines how text content is broken down into individual terms that can be indexed and searched. Weaviate supports several tokenization strategies:
word
- The default tokenization that splits text on whitespace and punctuation, converting to lowercase. Best for general text search where you want to match individual words.
lowercase
- Converts the entire property value to lowercase but treats it as a single token. Useful for exact matching of short strings like categories or tags while being case-insensitive.
whitespace
- Splits text only on whitespace characters, preserving punctuation and case. Good when punctuation is meaningful for search.
field
- Treats the entire property value as a single token without any processing. Use for exact matching of complete field values like IDs, email addresses, or URLs.
trigram
- Breaks text into overlapping 3-character sequences. Enables fuzzy matching and is useful for handling typos or partial matches.
gse
- Google Search Engine tokenization, optimized for Chinese, Japanese, and Korean text. Provides language-aware tokenization for CJK languages.
- Python
- JS/TS
- Java
- Go
from weaviate.classes.config import Configure, Property, DataType, Tokenization
client.collections.create(
"Article",
vector_config=Configure.Vectors.text2vec_cohere(),
properties=[
Property(
name="title",
data_type=DataType.TEXT,
tokenization=Tokenization.LOWERCASE, # Use "lowercase" tokenization
description="The title of the article.", # Optional description
),
Property(
name="body",
data_type=DataType.TEXT,
tokenization=Tokenization.WHITESPACE, # Use "whitespace" tokenization
),
],
)
const newCollection = await client.collections.create({
name: 'Article',
vectorizers: vectors.text2VecHuggingFace(),
properties: [
{
name: 'title',
dataType: dataType.TEXT,
tokenization: tokenization.LOWERCASE
},
{
name: 'body',
dataType: dataType.TEXT,
tokenization: tokenization.WHITESPACE
},
],
})
Property titleProperty = Property.builder()
.name("title")
.description("title of the article")
.dataType(Arrays.asList(DataType.TEXT))
.tokenization(Tokenization.WORD)
.build();
Property bodyProperty = Property.builder()
.name("body")
.description("body of the article")
.dataType(Arrays.asList(DataType.TEXT))
.tokenization(Tokenization.LOWERCASE)
.build();
// Add the defined properties to the collection
WeaviateClass articleCollection = WeaviateClass.builder()
.className(collectionName)
.description("Article collection Description...")
.properties(Arrays.asList(titleProperty, bodyProperty))
.build();
Result<Boolean> result = client.schema().classCreator()
.withClass(articleCollection)
.run();
vTrue := true
vFalse := false
articleClass := &models.Class{
Class: "Article",
Description: "Collection of articles",
Properties: []*models.Property{
{
Name: "title",
DataType: schema.DataTypeText.PropString(),
Tokenization: "lowercase",
IndexFilterable: &vTrue,
IndexSearchable: &vFalse,
ModuleConfig: map[string]interface{}{
"text2vec-cohere": map[string]interface{}{
"vectorizePropertyName": true,
},
},
},
{
Name: "body",
DataType: schema.DataTypeText.PropString(),
Tokenization: "whitespace",
IndexFilterable: &vTrue,
IndexSearchable: &vTrue,
ModuleConfig: map[string]interface{}{
"text2vec-cohere": map[string]interface{}{
"vectorizePropertyName": false,
},
},
},
},
Vectorizer: "text2vec-cohere",
}
Further resources
Questions and feedback
If you have any questions or feedback, let us know in the user forum.