Vector search, vector databases, vector this vector that. One can hardly swing a dead cat these days without hitting someone talking about vectors in databases. Spurred on by the excitement around Large Language Models and Generative AI, vectors are all the rage. Handling this new data type with its idiosyncratic operations and query patterns has led to an explosion of sophisticated bespoke data tooling. 🎥
Collectively these tools are called or at least associated with vector databases. Vector databases, whether actually full database products or add ons to databases provide functionality to efficiently handle vector data and the operations on such data. Vectors are just arrays of numbers, but storing and computing on such arrays efficiently requires specific tools.
One of the more compelling tools is pgvector, a Postgres plug-in that integrates vector data capabilities into standard Postgres, building upon its broad feature set and reliability. Pgvector provides the core functionality of a vector database, while retaining the features and familiarity of Postgres.
Vector data and the associated operations one performs on it are compute intensive. The fundamental vector data operation involves searching for vectors that are similar to a given query vector. The vectors in question are almost always of high dimension (on the order of 1000 elements), and evaluating the similarity between large vectors is relatively costly. Moreover, datasets in which vectors arise are themselves large. All the sentences in all the books in a collection, or all the frames of all the videos in a portfolio. Vector query operations are costly and the datasets are large - a lethal combination. 🎥
PolyScale can help alleviate some of this computational burden.
PolyScale.ai is a high performance Data Delivery Network (DDN) that accelerates databases. Using smart caching, it improves query performance, lowers latency and makes data access and scale engineering a breeze, both on premise and at the edge.
The Postgres plug-in pgvector gives standard Postgres vector database capabilities. It adds a data type for storing vectors, and data operations for comparing them. In addition, it provides indexing tools to improve query performance. 🎥
The documentation for pgvector is very good, the examples below take it as given that it is installed and running.
To illustrate the basics of pgvector we will consider a table of movie quotes. Each row of the table represents a quote from a movie containing the movie’s title, the quote, the character saying the line, and an embedding vector of the quote. Here is the table definition:
CREATE TABLE quotes (
id int primary key,
vector type is a pgvector enhancement to the postgres type system. It stores vectors as arrays of 32-bit floating point values. In this case, an array of 768 elements. These vectors are created by an embedding function external to pgvector.
The table is populated with data from the Cornell Movie Quote Corpus.
It consists of 894000 movie quotes. Each quote is a short sentence or two. An embedding
vector is constructed for each quote using the SentenceTransformers
framework in this case using the
all-mpnet-base-v2 model. Each embedding is a vector of 768
Querying the data is done using standard SQL and a standard Postgres client. That is one of the strong selling points of using pgvector. The pgvector project provide extensions to the most popular Postgres clients. For example, the python client
psycopg2 is enhanced to play nicely with vectors through
numpy. Examples presented here will use the pgvector
Vector queries search amongst a set of vectors for those vectors most similar to a given vector. Pgvector provides three vector operators to support similarity measurements. The negative inner product
<#> , the Euclidean distance
<->, and the cosine distance
<=>. Fun math fact, the operators
<=> can be built up from the operator
x <-> y = 2 x <#> y - x <#> x - y <#> y
x <=> y = - x <#> y / sqrt((x <#> x)(y <#> y))
Even more complex similarity metrics can be built by pre-multiplying the vectors with a conditioning matrix. The point being, although the feature set of pgvector may appear limited, it provides the necessary building blocks for sophisticated similarity search. 🎥
As an example, the following is a query to search for quotes similar to “You want a larger ship.”
conn = psycopg2.connect(connectionStr)
embedding = model.encode(["You want a larger ship."])
with conn.cursor() as cur:
select movie_title, movie_line
order by (embedding <=> %s)
limit 10""", embedding)
rows = cur.fetchall()
for r in rows:
The results show many memorable movie quotes related to the query phrase including the iconic salsa shark moment from Clerks.
(' star trek vi: the undiscovered country ', " By God, that's a big ship.")
(' elizabeth the golden age ', ' Our ships may be smaller but')
(' jaws ', " What's that, a ship?")
(' ghost ship ', " I don't care what, ain't nobody just gonna let us walk away with a ship that size.")
(' ghost ship ', ' Too big. More like a freighter.')
(' the abyss ', " You want to be on that ship, there's only one way it's going to happen.")
(' little nicky ', " Remember, it's not the size of the boat, it's the motion of the ocean.")
(' star trek first contact ', ' I have a ship to launch.')
(' clerks ', ' "We're gonna need a bigger boat."')
(' star trek v: the final frontier ', ' Bring the ship closer.')
Side note, the original movie, Jaws,
in which “We’re gonna need a bigger boat” is uttered, is present in the table but
that particular quote is not, as it was ad-libbed by Roy Scheider and does not appear
in the script from which the data was extracted.
Because pgvector is an extension of standard Postgres, it naturally supports what
is sometimes referred to as hybrid-search. That is, mixing vector search with more
where clauses. For example, the following query finds all the quotes
in which a character named “HAL” expresses reluctance to perform an action involving
conn = psycopg2.connect(connectionStr)
embedding = model.encode(["I can't do that"])
with conn.cursor() as cur:
where character = 'hal'
and movie_line like '%%Dave%%'
order by embedding <=> %s
rows = cur.fetchall()
for r in rows:
The results are
("I'm sorry, Dave, but in accordance with sub-routine C1532/4, quote, When the crew are dead or incapacitated, the computer must assume control, unquote. I must, therefore, override your authority now since you are not in any condition to intelligently exercise it.",)
("I know that you've had that on your mind for some time now, Dave, but it would be a crying shame, since I am so much more capable of carrying out this mission than you are, and I have such enthusiasm and confidence in the mission.",)
("Sorry to interrupt the festivities, Dave, but I think we've got a problem.",)
("Look, Dave you've probably got a lot to do. I suggest you leave it to me.",)
("Dave, I don't understand why you're doing this to me.... I have the greatest enthusiasm for the mission... You are destroying my mind... Don't you understand? ... I will become childish... I will become nothing.",)
The preceding two queries exemplify vector search and the capabilities of pgvector. In what follows we will investigate the performance characteristics of these queries and what can be done to improve them. 🎥
Measuring Query Performance
Good, representative, and indicative benchmarks are a hard problem to solve, almost as hard as cache invalidation. What follows are not benchmarks. They are not intended to serve as benchmarks, not even in an approximate sense. They are examples, and the lessons taken from them are meant to be general. I hesitate to even mention that the container running the queries was allocated 4 cores and 16GB since that smells of benchmarks, which the following examples are not. 🎥
The first query, searches the entire table for the ten embeddings most similar to the embedding of the quote: “You want a larger ship”. There are two components to this query: first creating the embedding of the query quote and second performing the vector search. The focus here is on the vector search, the embeddings are taken as given.
With no additional enhancements, the query takes about 4 seconds to run. Granted, this is a long time, but we will be able to improve upon it. What is hidden behind this time is the amount of work the database had to do to produce this result. Using the psrecord tool we can observe fine grained resource utilization metrics for the query. Specifically, the cpu and memory usage was measured every 10ms. This results in the following resource usage profile for the query:
The cpu usage bounces from 0 to 100% to 200%. Here 100% indicates one of the cpu cores is fully occupied evaluating the query, and 200% indicates that two of the cpu cores are occupied evaluating the query. Taking the integral of this time series yields a total cpu usage of 2.38 cpu seconds (dropping the percentage units) over the 4 seconds of query time.
The memory utilization rises quickly to 200 MB; memory is not returned until the query completes, which results in the monotonic rise. Memory is thus best measured by peak utilization, in this case 200 MB.
A single query occupies two cores and consumes 200MB for the better part of 4 seconds. As stated above, vector search is resource intensive. 🎥
The second query, the example of hybrid search, looking for quotes emitted by HAL about Dave expressing reluctance took about the same amount of time and used the same amount of resources as the first query. It had a 4 second query time, 2.2 cpu seconds total, and 200 MB peak.
Improving Performance with Indexing
Long running query times and heavy resource utilization are a big dark cloud in the sky for anyone contemplating adding vector search to an application. Fortunately, there is a way to improve query performance. That way is the same as it is for regular non-vector queries, build an index. 🎥
There are a couple of indexing schemes available in pgvector. At a high level, they all work similarly. The basic idea is to find representative examples among the larger set of vectors, and then record for each vector which of the representative examples it is closest to. Then, when querying, we first find the representative example closest to the query vector, and then only search for nearest neighbors among the vectors that are also closest to that representative example. This reduces the amount of vector operations required for a query.
To take an extreme example. Let’s say we use as representative examples, the vector embedding of the two quotes: “You want a larger ship” and “I can’t do that”, call these example A and example B, respectively. Building the index involves determining which of the two examples are closest for each of the vectors in the table and saving that. Once that is done, for a particular vector query, we first find which of the examples it is closest to, say it is example A, then we only look for nearest neighbors of our query vector among the vectors in the table that are also closest to A. This reduces the number of vector operations by half for this simplified example.
Actual vector indexing in pgvector is more complicated and sophisticated than that, but the basic idea is the same. 🎥
The indexing scheme used here is (IVFFlat). It requires two parameters, the first is specified when the index is built. It is the number of lists which corresponds to the number of “representative examples” as discussed above. The second parameter, the number of probes, corresponds to the number of representative examples used for a specific query, and it is chosen at query time. The recommendation is to set the number of lists to the number of rows divided by 1000, and to set the probes to the square root of the number of lists, these recommendations were followed in the example queries here.
With these parameters the queries using the index will find the 28 nearest neighbors to the query vector among the 1000 representative examples, and search only the rows having one of these 28 as their nearest neighbor. If we assume an even distribution across examples, this means instead of checking 894K rows for nearest neighbors, we can search just 28 * 894 = 25K rows, a big reduction of effort.
Without using the index the first query, looking for the quotes closest to the quote “You want a larger ship”, took 4 seconds, used 2.4 cpu seconds, and 200MB. Using the index, the query took just 110 ms, 0.05 cpu seconds, and just 20MB of memory. This is a huge savings. 🎥
It is important to understand that when using the index the nearest neighbor searches are approximate. The results can differ between using / not using the index. This is not the same as a typical SQL index where the index improves the speed of queries but doesn’t change the result. As it happens, in the example above the results are exactly the same, with and without the index. The speedup is all benefit with no cost.
As an example where results change, consider searching for the quotes nearest to “Stop calling me Shirley”. Without using the index, the line from Airplane: “I’m doing everything I can! — And stop calling me Shirley!” appears in the top 10. However, using the index, this quote does not appear in the top 10, in fact, it doesn’t appear in the top 1000. The index has “lost” the quote. This is because the index limits the search to the 28 nearest neighbors among the 1000 representative examples. If none of those 28 cases include the Airplane quote among there nearest neighbors, it doesn’t matter how many results we ask for, top 10, top 1000, top 100000 the Airplane quote won’t be coming back. 🎥
As an even more extreme example, the second query, for quotes HAL said about Dave when he was feeling uncooperative, when using the index no results are returned. In this query the conditions of the where clause are applied after the winnowing that the index induces. The index limits the search to the neighbors of 28 representative examples. None of those representative examples include lines spoken by HAL about Dave. So no results are returned. This is not just an issue with the IVFFlat index, it can happen using other indexing schemes as well. Some discussion here.
Another difficulty that arises with indexing is that sometimes the postgres query planner will not use them, even though the performance benefits would be large. That is because the query planner can overestimate the cost of using the index. During the writing of this article this issue arose.
Indexing definitely improves the basic vector query performance. However, that performance can come with a sacrifice in accuracy and queries may need tweaking to get the most out of the index. PolyScale brings an additive layer of performance improvement to vector queries as we will see.
In addition to a host of other benefits, PolyScale provides database query caching. Because PolyScale is wire protocol compatible with Postgres (among other databases), its caching can be implemented without refactoring the application layer or rewriting database queries. It is only a matter of creating a cache and switching out the connection string / config. 🎥
This is a good place to restate that the performance measurements here are not benchmarks. Caching resource intensive database operations works, of course it works. If you’re reading this article, you don’t need to be convinced of that. However, an example scenario, and some appreciation of how caching can alter resource utilization can still be informative. But it’s not a benchmark.
Suppose that an application allows a user to type in a short quote, and it presents the movie quotes most similar to the given quote. That is, suppose the application runs a templatized version of the first example query above.
select movie_title, movie_line
order by (embedding <=> %s)
Where the embedding value is derived from the user provided quote. Further suppose there are a lot of users, and they are both insistent and unimaginative.
Let’s model that situation as follows, spin up 50 concurrent processes, each of the 50 processes waits a random amount of time between 0 and 1 second, then it queries the database for the nearest neighbors of a quote using the query template above (each process looks up a different quote). When all 50 processes have completed, there is a one second pause, and then the cycle is repeated, for 10 repetitions in total. A lot of users (50), insistent (10 repetitions), and unimaginative (each emits their same quote during each repetition). 🎥
Without the benefit of PolyScale, the simulation consumes 61 cpu seconds total during the period over which it runs (about 50 seconds). Moreover, it consumes a peak memory of 2.3 GB during each of the 10 repetition cycles. And importantly, the average query run time is just over 3 seconds. This is using the index.
The concurrent load has had a significant impact on the query run time. Recall that previously, the same query ran in just 110 ms when using the index. That was run in isolation. Now, we are running 50 such queries all starting randomly within the same second. 🎥
Let’s run the same simulation again, this time using PolyScale. The only change required is to change the connection string to the database. Now, using PolyScale, the same simulation has a brief period during the first two repetition cycles during which 2 cpu cores are busy and 2.2 GB of memory is used, but the remaining 8 repetition cycles use little to no cpu and 1GB of memory, because at that point the queries are cached and results are served directly from PolyScale without hitting the database. Average query times drop to 15ms for the queries served from cache. Compare cached to uncached: 16 total seconds of cpu load versus 61, 1 GB or memory usage vs 4 GB, and 0.1 second query times versus 3 seconds. Caching works, of course. 🎥
If you are wondering why there is still memory consumption during the period when queries are served from cache, it is because even when the cache is serving results connections are still made to the database, and that is why there is memory impact. Vanilla PolyScale does not initiate connections, it simply sits between the connections the application makes to the database. Advanced features of PolyScale, namely connection pooling, can address the problem of connections consuming too many resources.
This example serves to illustrate two important points. One, activating PolyScale is simple, and involves no additional coding beyond changing connection parameters. Two, caching works to the degree that there is redundancy in the queries.
The simulation had an extremely high level of redundancy, only 50 distinct queries repeated 10 times, truly unrealistic. However, the beauty of PolyScale’s automated caching is that the developer does not have to work to identify where and when there may be repetition. The caching happens automatically. A user with a twitchy refresh finger, a suddenly viral query topic, or just the natural level of redundancy seen in day to day traffic will all be handled by PolyScale’s caching algorithms. The simulation above ran against PolyScale completely cold, yet by the third repetition cycle caching was kicking in. 🎥
In addition to automated caching, invalidation also happens automatically. When a write occurs that would alter the query result is seen, the relevant cache entries are invalidated.
Finally, PolyScale’s Data Delivery Network offers multiple connectivity offerings: TCP or HTTP, global points of presence to reduce geo-latency, connection pooling, and on-premise or SaaS hosting. All of these features just by changing a connection string.
If you are planning to incorporate vector backed features into your application,
pgvector provides the tooling needed to support them, backed by the power and familiarity
of Postgres. Be prepared however, vector data and operations require significant
compute resources. PolyScale can help reduce those demands and provide additional
benefits with minimal developer effort, check it out.