Auto-Embeddings
Generate embeddings for your application automatically
Auto-emeddings leverages pgvector and OpenAI to easily generate embeddings for your data in near real-time and without hassle. In addition, Auto-Embeddings allows you to seamlessly query your data using natural language or even using another object in your database (similarity search). Finally, the embeddings generated by Auto-Embeddings can be leveraged by your own code to perform any analysis or provide any extra functionality you may need.
Generating embeddings
Embeddings generation works in the following way:
- Every
sync-period
(5 minutes by default) graphite queries your data and searches for new data (or existing data that changed recently). - It extracts the relevant data for generating the embeddings
- It requests OpenAI to generate embeddings
- It stores the embeddings alongside the data.
Querying embeddings
Once the embeddings have been generated you can leverage pgvector to perform any operation you may want. In addition, Auto-Embeddings will automatically provide two new queries; graphiteSearchXXX
(search using natural language) and graphiteSimilarXXX
(search for objects similar to a given object), where XXX
is a name given by you.
Demo
To demonstrate how to configure Auto-Embeddings we will use an example project where we are storing movies.
Before we start, let’s start by explaining the project. Our project contains a single table called movies
with the following columns:
Our goal is going to be to generate embeddings using the data in the columns name
, genre
and overview
.
Preparing your database
Before we can start generating embeddings we need to prepare our database:
- First we are going to need a column to store the embeddings.
- Finally, we are going to need a mechanism to detect when embeddings need to be regenerated.
Embeddings Column
Creating a column to store embeddings is as easy as creating any other column. Just make sure it is of type vector(1536)
and it can be NULL
. For our project we can simply go to the SQL tab and create a migration with the following content:
For instance:
Detecting Changes
On every sync-period
graphite will perform a graphql query to get all the rows that have outdated embeddings. This means we can build this query in a way that:
- Gets the data we need for the embeddings.
- Retrieves objects with the embeddings column set to
NULL
. - Leverages another mechanism to detect which rows needs their embeddings regenerated.
In this example we are going to opt for the following mechanism to detect outdated embeddings:
- Add a column
outdated
(boolean) to indicated whether the row is outdated or not. - Add a postgres trigger and function that will set the
outdated
column to true everytime there is a change to our data.
With this in mind we are going to create a migration with the following contents:
For instance:
Now, graphite
’s query will be (more on this later):
outdated
directly from your application or you could use some completely different mechanism (i.e. a computed field). The important bit is that graphite needs to be able to make a graphql query and get the relevant rows and data.Configuring Auto-Embeddings
Now that we have prepared our database we can proceed to configure Auto-Embeddings. You will need the following data:
- A unique name. We are going to use
movies
for this particular example but it can be anything. This will determine the name of the GraphQL queriesgraphiteSearchXXX
andgraphiteSimilarXXX
. - The location of the embeddings column; schema, table and column names, in this example
public
,movies
andembeddings
respectively. - A GraphQL query to retrieve the outdated rows and their new data (the query we worked on in the previous section)
- A GraphQL mutation that takes the
id
of the object, the embeddings and that updates the relevant object. For instance, in this particular example the following mutation would suffice:
NOTE: It is important that the query returns the id of the object and that the mutation accepts it as otherwise graphite won’t know which object to update.
Now that we have all the data we need, adding the configuration is as simple as running the following graphql mutation:
Aftermath
Embeddings Generation
After executing the mutation above two things will happen; the first one is that if we look at our logs we will start seeing entries like this on the next sync-period
:
The logs indicate that graphite has started to generate embeddings for the movies. We can track the progress by counting movies with the embeddings
column set to NULL
:
Until eventually it reaches 0.
Natural Language and Similarity Search
The second thing that will happen is that the queries graphiteSearchMovies
, graphiteSearchMoviesAggregate
, graphiteSimilarMovies
and graphiteSimilarMoviesAggregate
will be created. These queries will work similar to the standard movies
and moviesAggregate
queries provided by hasura and will respect the same permissions but they will also allow you to query movies using natural language or other movies for comparison. For instance:
Both queries accept a third argument maxDistance
. This argument allows you to filter responses that are too far from the query, useful for ensuring that very specific queries do not return entirely unrelated responses. The argument is a float between 0.0
(an exact match) and 1.0 (completely unrelated), with a default setting of 1.0
, ensuring the best matches are always returned even if they are unrelated to the query.