Announcing CDC Support for Global Cache Invalidations

Ben Hagan
Oct 30, 2023

We are pleased to announce the release of our global Change Data Capture (CDC) consumer architecture, powering real-time invalidations with PolyScale. Customers can now use database CDC streams to directly power cache invalidations. Enable real time cache invalidation in scenarios where updates to the database happen out of band of PolyScale. CDC based invalidations also provide high accuracy for specific query patterns - more on this below.

PolyScale is an AI-driven, distributed cache for Postgres, MySQL, MariaDB, MS SQL and MongoDB. Using smart caching and a global Data Delivery Network (DDN), it reduces database load, improves query performance and lowers network latency both on premise and at the edge.

Caching database query results improves the performance at the application layer as well as at the database. Queries are faster because the results are available right next to the application and databases have less work to do. What’s not to love? But caches need to be invalidated when the underlying data changes, otherwise, while the results being served from the cache are served quickly, they no longer represent the ground truth of the database.

Caching within PolyScale has multiple invalidation mechanisms. CDC streams provide a new and highly effective method to power cache invalidations.

CDC and Cache Invalidation

CDC captures database changes and streams them in a coherent and consistent way to a separate system, in this case, PolyScale. When a write occurs to a database, for example,

UPDATE t1 SET c1 = 'x', c2 = 'y' WHERE c3 = 'z';

A message will be sent to the CDC stream detailing which rows have changed, what values they had prior to the change, and what values they have after the change. For example, the update statement above may generate the following message in the CDC stream,

    "before": { "c1": "a", "c2": "b", "c3": "z" },
    "after": { "c1": "x", "c2": "y", "c3": "z" }
    "before": { "c1": "a", "c2": "c", "c3": "z" },
    "after": { "c1": "x", "c2": "y", "c3": "z" }

Indicating two rows were affected by the update. In this case, the PolyScale caching layer now knows it should invalidate cache entries for queries that involve the relevant values of columns c1, c2, c3.

CDC based invalidation provides highly targeted and effective information for cache invalidation. Other mechanisms, such as PolyScale’s Smart Invalidation parse DML (write) queries to deduce what should be invalidated. But that only tells part of the story (the after component of the message above), and so invalidation must be done more broadly. For example, if the result of the query,

SELECT * FROM t1 WHERE c1 = 'w'

were cached. Based on the CDC messages, the update above would not invalidate, since neither the before or after values of c1 were equal to w. However, based solely on the update query, without access to the CDC messages, the result would be invalidated because there is no knowledge of the before values, and so no way to tell whether c1 = 'w' prior to the update.

When used in conjunction with PolyScale, CDC enables highly efficient cache invalidation.

CDC in Action

To see how this works in practice we’ll consider a query pattern that we notice being used by some of our customers that can lead to less than ideal invalidation when relying on query based PolyScale Smart Invalidation.

Consider a parameterized read query:

SELECT id, ts, src, data FROM t1 WHERE src = $1;

Imagine these reads get repeated several times for each of multiple values for the parameter $1. These repeated multiple reads all get cached and performance is spectacular.

Add to these reads, writes, using this query:

UPDATE t1 SET ts = now() WHERE id = $1;

The pattern here is that the application reads data by one identifier (in this case src) and updates the value in the database using a different identifier (in this case id). Using PolyScale Smart Invalidation, this update must invalidate all of the read queries regardless of what value of src was specified. That is because, based only on the queries, no connection can be made between the individual reads specifying src and the write specifying the id.

Using CDC, the connection between src and id will be revealed. The write query above will send a message of the form:

   before: {id: 12345, ts: '2023-09-20 …', src: 'x45562', data: },
   after:  {id: 12345, ts: '2023-09-26 …', src: 'x45562', data: }

and the PolyScale DDN will know it only needs to invalidate the cache entries corresponding to the query that specified src = 'x45562'.

To see CDC invalidation in action, consider the following experiment. Assume a table t1 exists with columns: id, src, ts. There are 5 unique values of src, and for each value of src there is a unique value of id. The timestamp column, ts, represents the time some action was taken for the given id, src pair (in practice, there would be some additional data columns, but for this example, the “data” is the timestamp).

We query this database repeatedly. For each iteration, with probability 90% we select a src value at random, and read the rows for that value of src. With probability 10% we update the value of the timestamp for an id chosen at random. This represents a typical read-heavy workload, which reads according to one identifier and writes according to a different identifier.

We are interested in the steady state behavior of the system. Running the experiment using PolyScale Smart Invalidation vs using CDC based invalidation we see that hit rates are much better when using CDC based invalidation. And the reason, as expected, is because the number of invalidations is higher when using Smart Invalidation.


Try It Out

At this time, the PolyScale CDC Engine is in beta release. If your database architecture makes sending DML (write) queries through PolyScale inconvenient, or if due to query complexities PolyScale Smart Invalidation is resulting in unnecessary invalidation, CDC invalidation may be right for you. You can read more about using Debezium for CDC with PolyScale within the documentation here and reach out to us for an opportunity to try it out.

Written by

Ben Hagan