Inverted index
The inverted index maps values (like words or numbers) to the objects that contain them. It is the backbone for all attribute-based filtering (where
filters) and keyword searching (bm25
, hybrid
).
Inverted index types
Multiple inverted index types are available in Weaviate. Not all inverted index types are available for all data types. The available inverted index types are:
Inverted index type | Description | Applicable data types | Default | Availability |
---|---|---|---|---|
indexSearchable | A searchable index for BM25-suitable Map index for BM25 or hybrid searching. | text , text[] , | true | v1.19 |
indexFilterable | A Roaring Bitmap index for match-based filtering. | Everything except blob , geoCoordinates , object and phoneNumber data types including arrays thereof | true | v1.19 |
indexRangeFilters | A Roaring Bitmap index for numerical range-based filtering. | int , number and date only | false | v1.26 |
- Enable one or both of
indexFilterable
andindexRangeFilters
to index a property for faster filtering.- If only one is enabled, the respective index is used for filtering.
- If both are enabled,
indexRangeFilters
is used for operations involving comparison operators, andindexFilterable
is used for equality and inequality operations.
Inverted index parameters
These parameters are set within the invertedIndexConfig
object in your collection definition.
Parameter | Type | Default | Details |
---|---|---|---|
bm25 | Object | { "k1": 1.2, "b": 0.75 } | Sets the k1 and b parameters for the BM25 ranking algorithm. Can be overridden at the property level. See BM25 Configuration below. |
stopwords | Object | (Varies) | Defines the stopword list to exclude common words from search queries. See Stopwords Configuration below. |
indexTimestamps | Boolean | false | If true , indexes object creation and update timestamps, enabling filtering by creationTimeUnix and lastUpdateTimeUnix . |
indexNullState | Boolean | false | If true , indexes the null/non-null state of each property, enabling filtering for null values. |
indexPropertyLength | Boolean | false | If true , indexes the length of each property, enabling filtering by property length. |
Enabling indexTimestamps
, indexNullState
, or indexPropertyLength
adds overhead as these additional indexes must be created and maintained. Only enable them if you require these specific filtering capabilities.
Code example
This code example shows how to configure inverted index parameters through a client library:
- Python Client v4
- JS/TS Client v3
- Java
- Go
from weaviate.classes.config import Configure, Property, DataType
client.collections.create(
"Article",
# Additional settings not shown
properties=[ # properties configuration is optional
Property(
name="title",
data_type=DataType.TEXT,
index_filterable=True,
index_searchable=True,
),
Property(
name="chunk",
data_type=DataType.TEXT,
index_filterable=True,
index_searchable=True,
),
Property(
name="chunk_number",
data_type=DataType.INT,
index_range_filters=True,
),
],
inverted_index_config=Configure.inverted_index( # Optional
bm25_b=0.7,
bm25_k1=1.25,
index_null_state=True,
index_property_length=True,
index_timestamps=True,
stopwords_preset="en",
stopwords_additions=["example", "stopword"],
stopwords_removals=["the", "and"],
),
)
import { dataType } from 'weaviate-client';
await client.collections.create({
name: 'Article',
properties: [
{
name: 'title',
dataType: dataType.TEXT,
indexFilterable: true,
indexSearchable: true,
},
{
name: 'chunk',
dataType: dataType.TEXT,
indexFilterable: true,
indexSearchable: true,
},
{
name: 'chunk_no',
dataType: dataType.INT,
indexRangeFilters: true,
},
],
invertedIndex: {
bm25: {
b: 0.7,
k1: 1.25
},
indexNullState: true,
indexPropertyLength: true,
indexTimestamps: true
}
})
// Create properties with specific indexing configurations
Property titleProperty = Property.builder()
.name("title")
.dataType(Arrays.asList(DataType.TEXT))
.indexFilterable(true)
.indexSearchable(true)
.build();
Property chunkProperty = Property.builder()
.name("chunk")
.dataType(Arrays.asList(DataType.INT))
.indexRangeFilters(true)
.build();
// Configure BM25 settings
BM25Config bm25Config = BM25Config.builder()
.b(0.7f)
.k1(1.25f)
.build();
// Configure inverted index with BM25 and other settings
InvertedIndexConfig invertedIndexConfig = InvertedIndexConfig.builder()
.bm25(bm25Config)
.indexNullState(true)
.indexPropertyLength(true)
.indexTimestamps(true)
.build();
// Create the Article collection with properties and inverted index configuration
WeaviateClass articleCollection = WeaviateClass.builder()
.className(collectionName)
.properties(Arrays.asList(titleProperty, chunkProperty))
.invertedIndexConfig(invertedIndexConfig)
.build();
// Add the collection to the schema
Result<Boolean> result = client.schema().classCreator()
.withClass(articleCollection)
.run();
vTrue := true
vFalse := false
articleClass := &models.Class{
Class: "Article",
Description: "Collection of articles",
Properties: []*models.Property{
{
Name: "title",
DataType: schema.DataTypeText.PropString(),
Tokenization: "lowercase",
IndexFilterable: &vTrue,
IndexSearchable: &vFalse,
},
{
Name: "chunk",
DataType: schema.DataTypeText.PropString(),
Tokenization: "word",
IndexFilterable: &vTrue,
IndexSearchable: &vTrue,
},
{
Name: "chunk_no",
DataType: schema.DataTypeInt.PropString(),
IndexRangeFilters: &vTrue,
},
},
InvertedIndexConfig: &models.InvertedIndexConfig{
Bm25: &models.BM25Config{
B: 0.7,
K1: 1.25,
},
IndexNullState: true,
IndexPropertyLength: true,
IndexTimestamps: true,
},
}
bm25
Part of invertedIndexConfig
. The settings for BM25 are the free parameters k1
and b
, and they are optional. The defaults (k1
= 1.2 and b
= 0.75) work well for most cases.
They can be configured per collection, and can optionally be overridden per property.
Example bm25
configuration - JSON object
An example of a complete collection object with bm25
configuration:
{
"class": "Article",
// Configuration of the sparse index
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
}
},
"properties": [
{
"name": "title",
"description": "title of the article",
"dataType": ["text"],
// Property-level settings override the collection-level settings
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
}
},
"indexFilterable": true,
"indexSearchable": true
}
]
}
stopwords
Part of invertedIndexConfig
. text
properties may contain words that are very common and don't contribute to search results. Ignoring them speeds up queries that contain stopwords, as they can be automatically removed from queries as well. This speedup is very notable on scored searches, such as BM25
.
The stopword configuration uses a preset system. You can select a preset to use the most common stopwords for a particular language (e.g. "en"
preset). If you need more fine-grained control, you can add additional stopwords or remove stopwords that you believe should not be part of the list. Alternatively, you can create your custom stopword list by starting with an empty ("none"
) preset and adding all your desired stopwords as additions.
Example stopwords
configuration - JSON object
An example of a complete collection object with stopwords
configuration:
"invertedIndexConfig": {
"stopwords": {
"preset": "en",
"additions": ["star", "nebula"],
"removals": ["a", "the"]
}
}
This configuration allows stopwords to be configured by collection. If not set, these values are set to the following defaults:
Parameter | Default value | Acceptable values |
---|---|---|
"preset" | "en" | "en" , "none" |
"additions" | [] | any list of custom words |
"removals" | [] | any list of custom words |
- If
preset
isnone
, then the collection only uses stopwords from theadditions
list. - If the same item is included in both
additions
andremovals
, Weaviate returns an error.
As of v1.18
, stopwords are indexed. Thus stopwords are included in the inverted index, but not in the tokenized query. As a result, when the BM25 algorithm is applied, stopwords are ignored in the input for relevance ranking but will affect the score.
Stopwords can now be configured at runtime. You can use the RESTful API to update the list of stopwords after your data has been indexed.
Note that stopwords are only removed when tokenization is set to word
.
indexTimestamps
Part of invertedIndexConfig
. To perform queries that are filtered by timestamps, configure the target collection to maintain an inverted index based on the objects' internal timestamps. Currently the timestamps include creationTimeUnix
and lastUpdateTimeUnix
.
To configure timestamp based indexing, set indexTimestamps
to true
in the invertedIndexConfig
object.
indexNullState
Part of invertedIndexConfig
. To perform queries that filter on null
, configure the target collection to maintain an inverted index that tracks null
values for each property in a collection .
To configure null
based indexing, setting indexNullState
to true
in the invertedIndexConfig
object.
indexPropertyLength
Part of invertedIndexConfig
. To perform queries that filter by the length of a property, configure the target collection to maintain an inverted index based on the length of the properties.
To configure indexing based on property length, set indexPropertyLength
to true
in the invertedIndexConfig
object.
Using these features requires more resources. The additional inverted indexes must be created and maintained for the lifetime of the collection.
How Weaviate creates inverted indexes
Weaviate creates separate inverted indexes for each property and each index type. For example, if you have a title
property that is both searchable and filterable,
Weaviate will create two separate inverted indexes for that property - one optimized for search operations and another for filtering operations.
Find out more in Concepts: Inverted index.
Adding a property after collection creation
Adding a property after importing objects can lead to limitations in inverted-index related behavior, such as filtering by the new property's length or null status.
This is caused by the inverted index being built at import time. If you add a property after importing objects, the inverted index for metadata such as the length or the null status will not be updated to include the new properties. This means that the new property will not be indexed for existing objects. This can lead to unexpected behavior when querying.
To avoid this, you can either:
- Add the property before importing objects.
- Delete the collection, re-create it with the new property and then re-import the data.
We are working on a re-indexing API to allow you to re-index the data after adding a property. This will be available in a future release.
Further resources
Questions and feedback
If you have any questions or feedback, let us know in the user forum.