Tutorial: Creating Film Database Embeddings with Python¶
This tutorial demonstrates how to create film database embeddings by calling various API endpoints for creating a table, creating a vector index, retrieving film data from a joined view, generating bulk embeddings, and then performing a bulk insert into the film embeddings table. In addition, it shows how to view the API documentation using ReDoc.
Table of Contents¶
Overview¶
In this tutorial, you will learn how to:
- Drop an existing
film_embeddings
table (if it exists). - Create a new
film_embeddings
table with the required columns. - Create a vector index on the embedding column.
- Retrieve joined film data using a GET request.
- Batch process the films (using a configurable batch size) by generating multiple embeddings in bulk from an external embeddings API.
- Perform a bulk insert of the generated film embeddings into the
film_embeddings
table.
The Python code uses the requests
library for all HTTP requests.
Database API Documentation¶
Python Example Code¶
Below is the complete Python script that performs all the steps described. Each section of the code is commented and explained.
#!/usr/bin/env python3
import requests
import json
# --------------------------------------------------------------------
# Set static variables
# --------------------------------------------------------------------
ACCESS_KEY = "YOUR STORAGE ACCESS KEY"
BACKEND_URL = "https://api.db.llmosaic.ai"
# Embeddings API settings (adjust if needed)
EMBEDDING_API_URL = "https://gpu1.llmosaic.ai/bge-large-en-v1.5/v1/embeddings"
EMBEDDING_MODEL = "BAAI/bge-large-en-v1.5"
EMBEDDING_ACCESS_KEY = "YOUR EMBEDDING ACCESS KEY"
EMBEDDING_API_AUTH = f"Bearer {EMBEDDING_ACCESS_KEY}"
# --------------------------------------------------------------------
# Set the number of document texts per embeddings API request and bulk insert
# --------------------------------------------------------------------
EMBEDDING_ROW_COUNT = 32
# --------------------------------------------------------------------
# Headers for backend and embedding API requests
# --------------------------------------------------------------------
headers_backend = {
"Authorization": f"Bearer {ACCESS_KEY}",
"Content-Type": "application/json"
}
headers_embedding = {
"accept": "application/json",
"Content-Type": "application/json",
"Authorization": EMBEDDING_API_AUTH
}
# --------------------------------------------------------------------
# Step 0: Drop the "film_embeddings" table if it exists
# --------------------------------------------------------------------
def drop_table():
print("Dropping table 'film_embeddings' if it exists...")
payload = {
"table_name": "film_embeddings",
"if_exists": True
}
response = requests.post(f"{BACKEND_URL}/drop-table", headers=headers_backend, json=payload)
response.raise_for_status()
print("Table 'film_embeddings' dropped (if it existed).")
# --------------------------------------------------------------------
# Step 1: Create the "film_embeddings" table
# --------------------------------------------------------------------
def create_table():
print("Creating table 'film_embeddings'...")
payload = {
"table_name": "film_embeddings",
"columns": [
{"name": "film_id", "type": "integer", "constraints": "PRIMARY KEY"},
{"name": "document_text", "type": "text"},
{"name": "embedding", "type": "vector(1024)"}
],
"not_exists": True
}
response = requests.post(f"{BACKEND_URL}/create-table", headers=headers_backend, json=payload)
response.raise_for_status()
print("Table 'film_embeddings' created.")
# --------------------------------------------------------------------
# Step 2: Create the vector index on "film_embeddings"
# --------------------------------------------------------------------
def create_vector_index():
print("Creating vector index 'film_embeddings_vector_index' on table 'film_embeddings'...")
payload = {
"table_name": "film_embeddings",
"index_name": "film_embeddings_vector_index",
"vector_column": "embedding",
"index_type": "hnsw",
"distance_operator": "vector_cosine_ops",
"not_exists": True
}
response = requests.post(f"{BACKEND_URL}/create-vector-index", headers=headers_backend, json=payload)
response.raise_for_status()
print("Vector index 'film_embeddings_vector_index' created.")
# --------------------------------------------------------------------
# Step 3: Retrieve film data using a join query
# --------------------------------------------------------------------
def get_film_data():
print("Querying joined film data from view 'film_list'...")
params = {"select": "fid,title,description,category,price,actors"}
response = requests.get(f"{BACKEND_URL}/film_list", headers={"Authorization": f"Bearer {ACCESS_KEY}"}, params=params)
response.raise_for_status()
film_data = response.json()
print("Film data retrieved:")
print(json.dumps(film_data, indent=2))
return film_data
# --------------------------------------------------------------------
# Step 4: Batch process film data and perform bulk embeddings & insertions
# --------------------------------------------------------------------
def process_film_batches(film_data):
num_films = len(film_data)
print(f"Processing {num_films} films to generate embeddings and insert into film_embeddings...")
film_ids_batch = []
doc_texts_batch = []
batch_count = 0
for i, film in enumerate(film_data):
film_id = film.get("fid")
title = film.get("title")
description = film.get("description")
category = film.get("category")
price = film.get("price")
actors = film.get("actors")
# Construct document text from the retrieved fields.
document_text = (
f"Film: {title}. Description: {description}. "
f"Category: {category}. Price: {price}. Actors: {actors}."
)
print(f"Queued film_id {film_id} for batch processing.")
film_ids_batch.append(film_id)
doc_texts_batch.append(document_text)
batch_count += 1
# If the batch is full or this is the last film, process the batch.
if batch_count == EMBEDDING_ROW_COUNT or (i == num_films - 1):
print(f"Processing batch of {batch_count} films...")
# Bulk generate embeddings by sending the batch of document texts in one request.
payload_embedding = {
"model": EMBEDDING_MODEL,
"input": doc_texts_batch
}
emb_response = requests.post(EMBEDDING_API_URL, headers=headers_embedding, json=payload_embedding)
emb_response.raise_for_status()
emb_data = emb_response.json()
# Assume each item in emb_data["data"] has an "embedding" field.
embeddings = [item.get("embedding") for item in emb_data.get("data", [])]
# Build bulk insert payload: list of objects, each with film_id, document_text, and embedding.
bulk_payload = []
for j in range(batch_count):
row_obj = {
"film_id": int(film_ids_batch[j]),
"document_text": doc_texts_batch[j],
"embedding": embeddings[j]
}
bulk_payload.append(row_obj)
print(f"Bulk inserting batch of {batch_count} rows into film_embeddings...")
insert_response = requests.post(f"{BACKEND_URL}/film_embeddings", headers=headers_backend, json=bulk_payload)
insert_response.raise_for_status()
print(json.dumps(insert_response.json(), indent=2))
print("")
# Reset batch arrays and counter for the next batch.
batch_count = 0
film_ids_batch = []
doc_texts_batch = []
# --------------------------------------------------------------------
# Main Function
# --------------------------------------------------------------------
def main():
try:
drop_table()
create_table()
create_vector_index()
film_data = get_film_data()
process_film_batches(film_data)
print("All film documents processed and inserted into film_embeddings.")
except Exception as e:
print(f"Error: {e}")
if __name__ == "__main__":
main()
Explanation of the Python Code¶
Step 0: Drop the Table (if exists)¶
- Function:
drop_table()
- This function sends a POST request to the
/drop-table
endpoint with the payload indicating that the table should be dropped if it exists.
Key code snippet:
payload = {"table_name": "film_embeddings", "if_exists": True}
response = requests.post(f"{BACKEND_URL}/drop-table", headers=headers_backend, json=payload)
Step 1: Create the Table¶
- Function:
create_table()
- This function sends a POST request to the
/create-table
endpoint to create thefilm_embeddings
table with three columns: film_id, document_text, and embedding.
Key code snippet:
payload = {
"table_name": "film_embeddings",
"columns": [
{"name": "film_id", "type": "integer", "constraints": "PRIMARY KEY"},
{"name": "document_text", "type": "text"},
{"name": "embedding", "type": "vector(1024)"}
],
"not_exists": True
}
Step 2: Create the Vector Index¶
- Function:
create_vector_index()
- This function sends a POST request to the
/create-vector-index
endpoint. The payload specifies the index name, the target table, and the details required for a vector index creation.
Key code snippet:
payload = {
"table_name": "film_embeddings",
"index_name": "film_embeddings_vector_index",
"vector_column": "embedding",
"index_type": "hnsw",
"distance_operator": "vector_cosine_ops",
"not_exists": True
}
Step 3: Retrieve Joined Film Data¶
- Function:
get_film_data()
- This function sends a GET request to the
/film_list
endpoint with query parameters that request specific film fields (fid
,title
,description
,category
,price
,actors
). The response is parsed and returned as JSON.
Key code snippet:
params = {"select": "fid,title,description,category,price,actors"}
response = requests.get(f"{BACKEND_URL}/film_list", headers={"Authorization": f"Bearer {ACCESS_KEY}"}, params=params)
film_data = response.json()
Step 4: Batch Process Film Data and Insert in Bulk¶
- Function:
process_film_batches(film_data)
- The function iterates over the film data while batching films in groups defined by
EMBEDDING_ROW_COUNT
(32). For each film, it constructs a document text string and, when the batch is full (or at the end of the list), sends a bulk POST request to the embeddings API. - The embeddings received are paired with film IDs and document texts to build a bulk insert payload. This payload is then sent as a single POST request to
/film_embeddings
.
Key code snippet for bulk embedding request:
payload_embedding = {
"model": EMBEDDING_MODEL,
"input": doc_texts_batch
}
emb_response = requests.post(EMBEDDING_API_URL, headers=headers_embedding, json=payload_embedding)
And for bulk insert:
bulk_payload = []
for j in range(batch_count):
row_obj = {
"film_id": int(film_ids_batch[j]),
"document_text": doc_texts_batch[j],
"embedding": embeddings[j]
}
bulk_payload.append(row_obj)
insert_response = requests.post(f"{BACKEND_URL}/film_embeddings", headers=headers_backend, json=bulk_payload)
Main Function¶
- Function:
main()
- This function orchestrates the entire workflow by calling the helper functions sequentially to drop the table, create the table and index, retrieve film data, and process the batches.
Running the Script¶
Install Requirements¶
Ensure you have Python 3 and the requests
library installed. You can install the requests library using:
Run the Script¶
Save the Python code to a file (e.g., create_film_embeddings.py
) and execute:
View Results¶
The script prints status messages for each step. If everything runs successfully, you will see that the film documents have been processed and inserted in batches into the film_embeddings
table.
Conclusion¶
In this tutorial, you learned how to build a Python script that:
- Drops an existing table,
- Creates a new table and vector index,
- Retrieves film data from an API view,
- Processes the data in batches to generate embeddings in bulk, and
- Inserts the embeddings into a database in a single bulk request.
Additionally, you now know how to use ReDoc to view the full API documentation by visiting the /docs
endpoint in your browser. This approach helps you both understand the API endpoints and how to integrate them into your client applications.
Feel free to modify the script as needed to suit your environment or customize the document text and embedding parameters.
Happy coding!