Approaching Cache Invalidation with PolyScale.ai
Matt Calder
Jun 26, 2023Cache invalidation is famously considered one of Computer Science’s hardest problems. Like most difficult problems, there is no one-size fits all solution. PolyScale.ai performs cache invalidation in a number of different ways. The following discusses why cache invalidation is difficult and the ways PolyScale approaches invalidation.
PolyScale.ai is a serverless, plug-and-play database cache. It dramatically accelerates database read performance, lowers network latency and reduces database infrastructure costs. PolyScale is plug-and-play requiring no code to implement, and zero configuration or tuning. It can be implemented in minutes with a simple configuration change. PolyScale offers a global edge network SaaS platform as well as a self hosted option.
Background
PolyScale is a fully automated database cache. It runs as a proxy between the application and the database. Every query the application sends to the database is a candidate for caching. The time to live (TTL) of cached entries is automatically determined and dynamically adjusted in real-time. In addition to algorithmically populating the cache, PolyScale also automatically invalidates cache entries that are deemed to be stale, that is, cache entries that no longer reflect the current state of the database. This invalidation occurs globally, across any PolyScale points of presence that have seen the relevant query.
With PolyScale, every query that is cacheable is cached. Cached items are stored until they no longer should be and are then invalidated, automatically.
Contrast this approach with a traditional caching layer implementation. In a traditional approach, a key value store such as Redis, serves as the cache persistence layer. A determination is made by developers of which queries to cache. Code is written to derive appropriate cache keys for the chosen queries. Code is written to connect the ORM or other query generating layer to the key value store. TTL’s are chosen based on the developers’ intuition and understanding of query traffic and inserted into the code. Invalidation, if it is done at all, involves matching all the ways the application might change the data to all the queries chosen to be cached.
There are many ways this manual approach can go wrong. The computational complexity of computing a cache key is isomorphic to the complexity of the underlying query, cache key aliasing (same key, different queries) is a frequent source of error. Static TTLs are a compromise, they may work well at one time, be too short and / or too long at other times. Every time a new query is added, or the data write path is modified, or the database schema is changed, the entire cache layer must be taken into consideration. No wonder caching is considered hard.
PolyScale performs all these tasks automatically. There are three methods PolyScale uses to determine when a cached item should be invalidated. All three methods approach the invalidation problem with different information and outcome. These different approaches are needed because, fundamentally, cache invalidation is hard.
The Problem with Invalidation
Cache invalidation is difficult because it involves a classic exploration / exploitation trade off. Delivering data from the cache is fast and reduces load on the database. Ideally, all query results should be delivered from the cache, the application would receive results in the least amount of time and the database would do the least amount of work. However, over time the database data can change and the cached query result might no longer be valid. The only way to discover this is to bypass the cache and send a query to the database.
It is best to deliver data from the cache until the cache result is no longer valid, but the only way to discover that the result is not valid is to get the result from the database.
Note that PolyScale works as a proxy. It does not generate queries on its own - these are always initiated from the client. This constraint may be relaxed in the future, making the invalidation problem easier but still based on the approaches discussed here.
There are two errors one might make when invalidating: not invalidating soon enough, or invalidating too soon. When data changes and a cached result is no longer valid it must be invalidated. If this is not done and the invalid data is served to the application, it is a type-1 error. On the other hand, if a cached result is invalidated, but the underlying data hasn’t changed, then a cache hit opportunity has been lost, this is a type-2 error.
Data Unchanged | Data Changed | |
---|---|---|
Cache Hit | Best outcome | Type-1 error |
Cache Miss | Type-2 error | Correct outcome |
The loss function for caching is typically one-sided: type-1 errors are much worse than type-2 errors. It is better to allow 10 cache misses that are type-2 errors than to have one cache hit that is a type-1 error. This prioritizes correctness over performance. There may be exceptions to this rule, but in general cache invalidation is the art of balancing the risk of committing a type-1 error with the cost of committing a type-2 error, while under the constraint that the only way to observe a change is to invalidate the cache.
The ability to diminish or remove the likelihood of either type of error depends on what information is available. The approaches to cache invalidation PolyScale employs can be organized around an increasing amount of auxiliary information.
Statistical Invalidation
One approach to resolving the tension between delivering hits to deliver value and delivering a miss to discover whether the data changed is to estimate when the data might have changed using the pattern of queries seen so far.
Using the patterns of the past to predict the future is Statistics, and applying that to the problem of cache invalidation is Statistical Invalidation.
There are quite a few observations available to make a statistical determination of when a query might change. The past rate of change for the query is one of the primary observations. Unfortunately, that only helps if a particular query is repeated continuously and only after observing many instances of a query and its result. A more timely estimate of when a specific query’s result might change is to look at the rates of change of related queries. This allows the transfer of learnings from one query to another.
Another important component of the statistical model used to estimate the data change rate is dynamic flexibility. A given query’s result can change frequently at one point in time and slowly at another time. The data change rate is estimated in such a way to allow for this variation.
One important use of Statistical Invalidation which is relied upon by all of the PolyScale invalidation methods is detecting when a query’s results are changing in unpredictable ways. When that determination is made, caching of that query is disallowed until it becomes predictable again. In this way, Statistical Invalidation provides an ever present safety net.
Statistical Invalidation can reduce but not eliminate type-1 errors. In order to err on the side of caution, there needs to be a tolerance for committing type-2 errors, which can lead to excessive invalidation. On balance, relative to the other invalidation methods described below, Statistical Invalidation has the highest type-1 and type-2 error levels. If an application has a low tolerance for any type-1 errors and / or wants to minimize type-2 errors, a better approach relying on additional data is required.
DML Based Invalidation
Data within the database does not change spontaneously. It is changed deliberately using the same mechanism for reading the data, queries. In SQL the main queries that alter data are INSERT, UPDATE, and DELETE. Queries that write data collectively are referred to as DML (Data Manipulation Language) queries. In contrast, queries that simply read data (for example SELECT or SHOW) are referred to as DQL (Data Query Language) queries.
DML Invalidation works by analyzing the syntactic structure of all queries and associating those DML queries that change specific data with the DQL queries that read that data.
Often both the writes (DML) and reads (DQL) are being performed by the same application. That application can then send both types of queries to PolyScale. When both DML and DQL are sent to PolyScale a new level of invalidation can occur, DML Smart Invalidation.
As an example, suppose one is caching a DQL query:
SELECT c1 FROM t1 WHERE c2 = 1
If the DML query:
UPDATE t1 SET c1 = 'a' WHERE c2 = 1
is observed, then the cached result of the DQL query must be invalidated, the result of the query may have changed. On the other hand, suppose this DML query is observed:
UPDATE t1 SET c1 = 'a' WHERE c2 = 2
then the cached result of the DQL query can be retained. This second DML query does not change the result of the DQL query due to the non-overlapping conditions of the where clauses.
The syntactic structure of a query considerably narrows the range of data that is altered by DML queries and read by DQL queries and allows for more targeted invalidation. This narrowing occurs at multiple levels: the tables, columns, and rows being written / read:
DML Invalidation is done globally, against any PolyScale caches that access the same data. Organizing and acting on this information automatically and efficiently across millions of queries globally is where DML Invalidation becomes DML Smart Invalidation.
DML Invalidation, even at its smartest, has limitations. It is not always possible to narrow the scope of data affected by a DML query and/or read by a DQL query sufficiently to prevent invalidating unnecessarily. For example, a common idiom found in applications is write-by condition-read-by-id:
UPDATE t1 SET c1=1 WHERE c2 = 2 AND c3 = 'a'
SELECT c1,c2,c3 FROM t1 WHERE cId = 123
In this case, the DML must be assumed to invalidate the DQL query even though the cached result might remain valid. That is because it is impossible to say for certain that the query conditions do not overlap. PolyScale will always invalidate if there is any possibility that a given DML query might invalidate a given DQL query.
DML Invalidation is highly accurate and type-1 errors are virtually non-existent. DML Invalidation also has fewer type-2 errors than pure Statistical Invalidation but it still has some. It cannot eliminate type-2 errors as it must invalidate in the absence of certainty. In order to reduce type-2 errors an even more sophisticated level of data change visibility is required.
CDC Invalidation
Change Data Capture (CDC) is a built-in feature of modern database systems that tracks the changes that occur in the database and streams a representation of those changes to consumer processes. CDC forms the basis for building read replicas and provides exact and targeted information for doing cache invalidation.
CDC data provides details about any and all changes to individual rows of data within the database.
CDC is complicated and the actual implementation is dependent on the specific database involved. PolyScale supports CDC streams and relies on the highly regarded open source platform Debezium to do CDC and manage the resulting streams of data changes. Debezium provides a uniform representation of data changes across all the database platforms PolyScale supports.
Specifically, CDC events call out the values of a changed row before and after the change. For example, consider the DML and DQL queries from before:
UPDATE t1 SET c1=1 WHERE c2 = 2 AND c3 = 'a'
SELECT c1,c2,c3 FROM t1 WHERE cId = 123
The DML query might generate the following CDC event:
{
table: t1
before: {cId: 101, c1: 0, c2: 2, c3: 'a'},
after: {cId: 101, c1: 1, c2: 2, c3: 'a'}
}
Note that this is more information than can be gleaned from just the DML query itself. As described above, based on just the DML query this must be invalidated. However, the CDC event reveals that it is not necessary to invalidate this DQL based on the actual data changed because the row that changed has a different cId value than the DQL query.
CDC Invalidation gives the best of both worlds, almost zero type-1 and the lowest type-2 error rates. The downside is that generating the CDC information puts additional load on the database. Database systems expend great effort to minimize the cost of CDC but it is never going to be zero.
Latency
One topic not yet covered is latency. When it is decided that a cached entry is no longer valid and needs to be invalidated, it takes time for the cached result to be removed. The different approaches to invalidation all have different latencies.
Statistical Invalidation has zero latency, because the invalidation is embedded in the TTL, and while the invalidation can happen too late or too soon, it is always on time.
A type-1 error occurs when a cached query result is served that does not reflect the current state of the database. By similar reckoning, the latency of an invalidation is the amount of time a stale cached result might still be served after an observed change occurs at the database. Note that latency in the case of error is unbounded and in that case uninteresting. Statistical Invalidation can be wrong, but it doesn’t have latency in the sense meant here.
The latency clock starts ticking when the database data changes. In the case of DML Invalidation, the invalidation process is already well underway by the time that happens. DML queries pass through PolyScale, which simultaneously both sends them on to the database and begins the invalidation process. Depending on the network topology, the DML cache invalidation can actually occur prior to the database change, and result in zero latency. In general, it’s a tight race, and latencies for DML Invalidation are low but not zero.
CDC Invalidation on the other hand originates at the database, and CDC events necessarily occur after the database data has changed. Those events must then find their way back to the PolyScale cache and this adds additional time. CDC Invalidation latencies are highly dependent on database load and where the database is located relative to the PolyScale cache.
Both DML and CDC Invalidations are propagated globally. That global invalidation adds an additional layer of latency. In the case of DML Invalidation, the latency corresponds to the network distance between the PolyScale cache that saw the original DML query and any other caches that have affected cache entries. In the case of CDC Invalidation, all invalidations originate at the database and happen simultaneously. Their individual latencies depend on their network distance between the database and the PolyScale cache.
Conclusion
Cache invalidation is hard. In the presence of imperfect information the risk of type-1 error can be reduced by either increasing the chance of type-2 error or by acquiring better information. The table below summarizes the trade offs:
Invalidation | Type-1 Error Rate | Type-2 Error Rate | Additional Data | Latency |
---|---|---|---|---|
Statistical | Low | Low | None | None |
DML | Near zero | Low | DML Queries | ~10ms |
CDC | Near zero | Lowest | CDC Data | ~100ms |
If an application’s architecture permits it, DML or CDC Invalidation provide the most comprehensive and accurate level of invalidation. DML Invalidation will provide the fastest turnaround between data change and invalidation. CDC Invalidation will provide the most targeted invalidation and maximizes hit rates. If DML queries are not accessible to the application and CDC is not an option, Statistical Invalidation can offer an acceptable solution on its own.
PolyScale provides automated caching with sophisticated multi-tier cache invalidation out of the box. Sign up for your own free PolyScale account here (no credit card required). Connect your database to a cache and start delivering your data around the globe, faster. Or try out PolyScale without bringing your own database with our live interactive demo playground.