Wikipedia with custom vectors
This tutorial is currently being updated to reflect the latest features and improvements in Weaviate. We appreciate your patience and invite you to check back soon for the updated content.
This tutorial will show you how to import a large dataset (25k articles from Wikipedia) that already includes vectors (embeddings generated by OpenAI). We will,
- download and unzip a CSV file that contains the Wikipedia articles
- create a Weaviate instance
- create a schema
- parse the file and batch import the records, with Python and JavaScript code
- make sure the data was imported correctly
- run a few queries to demonstrate semantic search capabilities
Prerequisites
If you haven't yet, we recommend going through the Quickstart tutorial first to get the most out of this section.
Before you start this tutorial, make sure to have:
- An OpenAI API key. Even though we already have vector embeddings generated by OpenAI, we'll need an OpenAI key to vectorize search queries, and to recalculate vector embeddings for updated object contents.
- Your preferred Weaviate client library installed.
See how to delete data from previous tutorials (or previous runs of this tutorial).
You can delete any unwanted collection(s), along with the data that they contain.
When you delete a collection, you delete all associated objects!
Be very careful with deletes on a production database and anywhere else that you have important data.
This code deletes a collection and its objects.
If a snippet doesn't work or you have feedback, please open a GitHub issue.
# collection_name can be a string ("Article") or a list of strings (["Article", "Category"])
client.collections.delete(
collection_name
) # THIS WILL DELETE THE SPECIFIED COLLECTION(S) AND THEIR OBJECTS
# Note: you can also delete all collections in the Weaviate instance with:
# client.collections.delete_all()
Download the dataset
We will use this Simple English Wikipedia dataset hosted by OpenAI (~700MB zipped, 1.7GB CSV file) that includes vector embeddings. These are the columns of interest, where content_vector
is a vector embedding with 1536 elements (dimensions), generated using OpenAI's text-embedding-ada-002
model:
id | url | title | text | content_vector |
---|---|---|---|---|
1 | https://simple | April | "April is the fourth month of the year..." | [-0.011034, -0.013401, ..., -0.009095] |
If you haven't already, make sure to download the dataset and unzip the file. You should end up with vector_database_wikipedia_articles_embedded.csv
in your working directory. The records are mostly (but not strictly) sorted by title.
Download Wikipedia dataset ZIP
Create a Weaviate instance
We can create a Weaviate instance locally using the embedded option on Linux (transparent and fastest), Docker on any OS (fastest import and search), or in the cloud using the Weaviate Cloud (easiest setup, but importing may be slower due to the network speed). Each option is explained on its Installation page.
If using the Docker option, make sure to select "With Modules" (instead of standalone), and the text2vec-openai
module when using the Docker configurator, at the "Vectorizer & Retriever Text Module" step. At the "OpenAI Requires an API Key" step, you can choose to "provide the key with each request", as we'll do so in the next section.
Connect to the instance and OpenAI
Add the OpenAI API key to the client so you can use the OpenAI vectorizer API when you send queries to Weaviate.
The API key can be provided to Weaviate as an environment variable, or in the HTTP header with every request. This example adds the key to the client. The client sends the key with every request as a part of the HTTP request header.
If a snippet doesn't work or you have feedback, please open a GitHub issue.
import weaviate
# Instantiate the client with the auth config
client = weaviate.Client(
url="https://WEAVIATE_INSTANCE_URL", # Replace with your Weaviate endpoint
auth_client_secret=weaviate.auth.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY"), # Replace with your Weaviate instance API key
additional_headers={
"X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY",
},
)
Create the schema
The schema defines the data structure for objects in a given Weaviate class. We'll create a schema for a Wikipedia Article
class mapping the CSV columns, and using the text2vec-openai vectorizer. The schema will have two properties:
title
- article title, not vectorizedcontent
- article content, corresponding to thetext
column from the CSV
As of Weaviate 1.18, the text2vec-openai
vectorizer uses by default the same model as the OpenAI dataset, text-embedding-ada-002
. To make sure the tutorial will work the same way if this default changes (i.e. if OpenAI releases an even better-performing model and Weaviate switches to it as the default), we'll configure the schema vectorizer explicitly to use the same model:
{
"moduleConfig": {
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text"
}
}
}
Another detail to be careful about is how exactly we store the content_vector
embedding. Weaviate vectorizes entire objects (not properties), and it includes by default the class name in the string serialization of the object it will vectorize. Since OpenAI has provided embeddings only for the text
(content) field, we need to make sure Weaviate vectorizes an Article
object the same way. That means we need to disable including the class name in the vectorization, so we must set vectorizeClassName: false
in the text2vec-openai
section of the moduleConfig
. Together, these schema settings will look like this:
If a snippet doesn't work or you have feedback, please open a GitHub issue.
# client.schema.delete_all() # ⚠️ uncomment to start from scratch by deleting ALL data
# ===== Create Article class for the schema =====
article_class = {
"class": "Article",
"description": "An article from the Simple English Wikipedia data set",
"vectorizer": "text2vec-openai",
"moduleConfig": {
# Match how OpenAI created the embeddings for the `content` (`text`) field
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text",
"vectorizeClassName": False
}
},
"properties": [
{
"name": "title",
"description": "The title of the article",
"dataType": ["text"],
# Don't vectorize the title
"moduleConfig": {"text2vec-openai": {"skip": True}}
},
{
"name": "content",
"description": "The content of the article",
"dataType": ["text"],
}
]
}
# Add the Article class to the schema
client.schema.create_class(article_class)
print('Created schema');
To quickly check that the schema was created correctly, you can navigate to <weaviate-endpoint>/v1/schema
. For example in the Docker installation scenario, go to http://localhost:8080/v1/schema
or run,
curl -s http://localhost:8080/v1/schema | jq
The jq
command used after curl
is a handy JSON preprocessor. When simply piping some text through it, jq
returns the text pretty-printed and syntax-highlighted.
Import the articles
We're now ready to import the articles. For maximum performance, we'll load the articles into Weaviate via batch import.
If a snippet doesn't work or you have feedback, please open a GitHub issue.
# ===== Import data =====
# Settings for displaying the import progress
counter = 0
interval = 100 # print progress every this many records
# Create a pandas dataframe iterator with lazy-loading,
# so we don't load all records in RAM at once.
import pandas as pd
csv_iterator = pd.read_csv(
'vector_database_wikipedia_articles_embedded.csv',
usecols=['id', 'url', 'title', 'text', 'content_vector'],
chunksize=100, # number of rows per chunk
# nrows=350 # optionally limit the number of rows to import
)
# Iterate through the dataframe chunks and add each CSV record to the batch
import ast
client.batch.configure(batch_size=100) # Configure batch
with client.batch as batch:
for chunk in csv_iterator:
for index, row in chunk.iterrows():
properties = {
"title": row.title,
"content": row.text,
"url": row.url
}
# Convert the vector from CSV string back to array of floats
vector = ast.literal_eval(row.content_vector)
# Add the object to the batch, and set its vector embedding
batch.add_data_object(properties, "Article", vector=vector)
# Calculate and display progress
counter += 1
if counter % interval == 0:
print(f"Imported {counter} articles...")
print(f"Finished importing {counter} articles.")
Checking the import went correctly
Two quick sanity checks that the import went as expected:
- Get the number of articles
- Get 5 articles
- Open the Weaviate Query app
- Connect to your Weaviate endpoint, either
http://localhost:8080
orhttps://WEAVIATE_INSTANCE_URL
. (Replace WEAVIATE_INSTANCE_URL with your instance URL.) - Run this GraphQL query:
query {
Aggregate { Article { meta { count } } }
Get {
Article(limit: 5) {
title
url
}
}
}
You should see the Aggregate.Article.meta.count
field equal to the number of articles you've imported (e.g. 25,000), as well as five random articles with their title
and url
fields.
Queries
Now that we have the articles imported, let's run some queries!
nearText
The nearText
filter lets us search for objects close (in vector space) to the vector embedding of one or more concepts. For example, the vector for the query "modern art in Europe" would be close to the vector for the article Documenta, which describes
"one of the most important exhibitions of modern art in the world... [taking] place in Kassel, Germany".
{
Get {
Article(
nearText: {concepts: ["modern art in Europe"]},
limit: 1
) {
title
content
}
}
}
hybrid
While nearText
uses dense vectors to find objects similar in meaning to the search query, it does not perform very well on keyword searches. For example, a nearText
search for "jackfruit" in this Simple English Wikipedia dataset, will find "cherry tomato" as the top result. For these (and indeed, most) situation, we can obtain better search results by using the hybrid
filter, which combines dense vector search with keyword search:
{
Get {
Article (
hybrid: {
query: "jackfruit"
alpha: 0.5 # default 0.75
}
limit: 3
) {
title
content
_additional {score}
}
}
}
Recap
In this tutorial, we've learned
- how to efficiently import large datasets using Weaviate batching and CSV lazy loading with
pandas
/csv-parser
- how to import existing vectors ("Bring Your Own Vectors")
- how to quickly check that all records were imported
- how to use
nearText
andhybrid
searches
Suggested reading
Questions and feedback
If you have any questions or feedback, let us know in the user forum.