Auto-emeddings leverages pgvector and OpenAI to easily generate embeddings for your data in near real-time and without hassle. In addition, Auto-Embeddings allows you to seamlessly query your data using natural language or even using another object in your database (similarity search). Finally, the embeddings generated by Auto-Embeddings can be leveraged by your own code to perform any analysis or provide any extra functionality you may need.

Generating embeddings

Embeddings generation works in the following way:

  1. Every sync-period (5 minutes by default) graphite queries your data and searches for new data (or existing data that changed recently).
  2. It extracts the relevant data for generating the embeddings
  3. It requests OpenAI to generate embeddings
  4. It stores the embeddings alongside the data.

Querying embeddings

Once the embeddings have been generated you can leverage pgvector to perform any operation you may want. In addition, Auto-Embeddings will automatically provide two new queries; graphiteSearchXXX (search using natural language) and graphiteSimilarXXX (search for objects similar to a given object), where XXX is a name given by you.

Demo

To demonstrate how to configure Auto-Embeddings we will use an example project where we are storing movies.

Before we start, let’s start by explaining the project. Our project contains a single table called movies with the following columns:

Our goal is going to be to generate embeddings using the data in the columns name, genre and overview.

Preparing your database

Before we can start generating embeddings we need to prepare our database:

  1. First we are going to need a column to store the embeddings.
  2. Finally, we are going to need a mechanism to detect when embeddings need to be regenerated.

Embeddings Column

Creating a column to store embeddings is as easy as creating any other column. Just make sure it is of type vector(1536) and it can be NULL. For our project we can simply go to the SQL tab and create a migration with the following content:

-- in this example we are using "embeddings" as the name for our column
-- but you can choose anything you want
ALTER TABLE public.movies ADD COLUMN embeddings vector(1536) NULL;

For instance:

Detecting Changes

On every sync-period graphite will perform a graphql query to get all the rows that have outdated embeddings. This means we can build this query in a way that:

  1. Gets the data we need for the embeddings.
  2. Retrieves objects with the embeddings column set to NULL.
  3. Leverages another mechanism to detect which rows needs their embeddings regenerated.

In this example we are going to opt for the following mechanism to detect outdated embeddings:

  1. Add a column outdated (boolean) to indicated whether the row is outdated or not.
  2. Add a postgres trigger and function that will set the outdated column to true everytime there is a change to our data.

With this in mind we are going to create a migration with the following contents:

-- Add the "outdated" column of type "boolean" with a default value of true
ALTER TABLE "movies" ADD COLUMN "outdated" boolean DEFAULT true;

-- Create a trigger that sets "outdated" to true if the columns
-- "name", "genre" or "overview" are updated
CREATE OR REPLACE FUNCTION set_outdated_trigger()
RETURNS TRIGGER AS $$
BEGIN
    IF NEW.name <> OLD.name OR NEW.genre <> OLD.genre OR NEW.overview <> OLD.overview THEN
        NEW.outdated := true;
    ELSEIF NEW.embeddings IS NOT NULL THEN
        NEW.outdated := false;
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Attach the trigger to the table
CREATE TRIGGER set_outdated
BEFORE UPDATE ON "movies"
FOR EACH ROW
EXECUTE FUNCTION set_outdated_trigger();

For instance:

Now, graphite’s query will be (more on this later):

query GetOutdatedMovies {
  movies(where: {
    _or: [
      {embeddings: {_is_null: true}}, # new rows without embeddings
      {outdated: {_eq: true},         # existing rows with changed data
    },
  ]}) {
    id                                # id column is mandatory
    name
    genre
    overview
  }
}
This mechanism has the advantage that is simple enough and fully handled by postgres so you don’t need to worry about it. In addition, it should be flexible enough to cover most cases. For very complex use-cases you could skip the postgres function and trigger and update the outdated directly from your application or you could use some completely different mechanism (i.e. a computed field). The important bit is that graphite needs to be able to make a graphql query and get the relevant rows and data.

Configuring Auto-Embeddings

Now that we have prepared our database we can proceed to configure Auto-Embeddings. You will need the following data:

  1. A unique name. We are going to use movies for this particular example but it can be anything. This will determine the name of the GraphQL queries graphiteSearchXXX and graphiteSimilarXXX.
  2. The location of the embeddings column; schema, table and column names, in this example public, movies and embeddings respectively.
  3. A GraphQL query to retrieve the outdated rows and their new data (the query we worked on in the previous section)
  4. A GraphQL mutation that takes the id of the object, the embeddings and that updates the relevant object. For instance, in this particular example the following mutation would suffice:
mutation UpdateEmbeddingsMovie($id: uuid!, $embeddings: vector) {
  updateMovie(pk_columns: {id: $id}, _set: {embeddings: $embeddings}) {
    __typename
  }
}

NOTE: It is important that the query returns the id of the object and that the mutation accepts it as otherwise graphite won’t know which object to update.

Now that we have all the data we need, adding the configuration is as simple as running the following graphql mutation:

Aftermath

Embeddings Generation

After executing the mutation above two things will happen; the first one is that if we look at our logs we will start seeing entries like this on the next sync-period:

The logs indicate that graphite has started to generate embeddings for the movies. We can track the progress by counting movies with the embeddings column set to NULL:

Until eventually it reaches 0.

The second thing that will happen is that the queries graphiteSearchMovies, graphiteSearchMoviesAggregate, graphiteSimilarMovies and graphiteSimilarMoviesAggregate will be created. These queries will work similar to the standard movies and moviesAggregate queries provided by hasura and will respect the same permissions but they will also allow you to query movies using natural language or other movies for comparison. For instance:

Both queries accept a third argument maxDistance. This argument allows you to filter responses that are too far from the query, useful for ensuring that very specific queries do not return entirely unrelated responses. The argument is a float between 0.0 (an exact match) and 1.0 (completely unrelated), with a default setting of 1.0, ensuring the best matches are always returned even if they are unrelated to the query.