A recent paper explores an interesting alternative to LRU as a cache data structure. The algorithm described in the paper is motivated by scaling laws for the distribution of one-hit-wonders (one hits). One hits refer to items in the stream of cacheable items that are only observed once. These one hits are a bane to caching algorithms as caching them is a waste since they never repeat. The paper shows how the occurrence of one hits scales with the number of items processed by the cache. The paper is well written and explains their work in detail, I won’t explain it here. Instead I offer a small mathematical homage to their work.
PolyScale.ai is a serverless, plug-and-play database cache. It dramatically accelerates database read performance, lowers network latency and reduces database infrastructure costs. PolyScale is plug-and-play requiring no code to implement, and zero configuration or tuning. It can be implemented in minutes with a simple configuration change. PolyScale offers a global edge network SaaS platform as well as a self hosted option.
The setting in which the analysis of one hits occurs is as follows. Assume there is a set of symbols and that there are m of them. These symbols represent the cacheable items. Further, assume there is a sequence of these symbols, and the sequence is of length n. This sequence represents the flow of data into the cache. The question of interest examined here is how many of the symbols occur only once in the sequence. That is, how many one hits are there.
To make things a little more concrete before wading into the mathematical pool, imagine the symbols correspond to distinct SQL queries, and that the sequence is the sequence of queries coming from an application and arriving at a PolyScale database cache point of presence. Any time taken processing or space used storing a query that occurs once in the sequence is time and space wasted.
The algorithm described in the paper does a great job avoiding this waste and is able to do so by accounting for the way such one hits are distributed, given the number of symbols m, and the length of the sequence n. Let the set of m symbols be:
And let the sequence of symbols be:
where each symbol in the sequence is independent and identically distributed with a Zipf distribution with tail index α:
The Zipf distribution provides a realistic approximation to the distribution of items flowing into a cache across a wide variety of real life data sources (as demonstrated in the paper). It also has a unique stability property which is exploited in the analysis of one hits that follows.
Counting One Hits
Denote the count of symbol k in a (m, n)-sequence (a sequence of length n constructed from m unique symbols) as
Then the number of one hits is
that is, it is the sum over symbols of the indicator that the symbol occurs only once. The expectation of the number of one hits is the thing we will be computing
In order to compute this expectation we make use of an interesting property of the Zipf distribution. Given a random variable with a Zipf distribution over the integers 1 to m. If you condition the random variable and say it is not equal to the largest value m, then the conditioned random variable has a Zipf distribution over the integers 1 to m-1. It is this stability property that lets us decompose the expected number of one hits via conditioning on the count of the m-th symbol.
In the first line above, we condition the expectation on the count of the m-th symbol. The trick is to recognize that the conditional expectation expression is taken over the exact same distribution as the original problem with one less symbol and r less members of the sequence resulting in the second line. Expressing this in long form,
the second line stands out as we must account for the one additional one hit of the m-th symbol. And the last two lines stand out as they form the initial conditions that will allow us to boot strap, in a manner similar to dynamic programming, from the simple cases of small m, n.
The final mathematical detail is to recognize that the probabilities in the breakout above are simply binomial probabilities.
The expression for the expected number of one hits is compute friendly, since the right hand side consists of terms having m and n reduced. A reference implementation, slow but easy to read, is
import numpy as np
from scipy.stats import binom
def Hma(m, a=1):
# Normalizing constant for Zipf(a)
h = 0
for k in range(1,m+1):
h += (1 / pow(k, a))
def muOneHits(M, N, a):
# Expected number of symbols occuring once, when:
# M symbols in sequence of length N, iid Zipf(a)
# returns expected value as function of n = 1,...,N
U = np.zeros((M, N))
U[:,0] = 1
for m in range(1, M):
p = 1 / (Hma(m+1, a) * pow(m+1, a))
for n in range(1, N):
for r in range(n+2):
c = binom.pmf(r, n+1, p)
if r == 1:
U[m, n] += c * (1 + U[m-1, n-r])
U[m, n] += c * U[m-1, n-r]
Using the above code, we can study the variation in the expected number of one hits as a function of the sequence length. For example, the following shows the expected number of one hits for m=10 symbols and sequence lengths ranging from 1 to 300.
As noted in the paper, we see that for smallish sequence lengths, the number of one hits is relatively high. As the length of the sequence increases the expected number of one hits decreases. This wide range of expected one hits, and most importantly the high values for shorter sequences motivates the algorithm described in the paper.
The dependency on the Zipf index α is shown below
where we see that for more uniform distributions (α = 0.8) the peak expected one hits is higher but the decay towards zero is faster. Conversely, the more skewed case (α = 1.2) has a lower peak and trails off more slowly.
The motivation for this mathematical stroll was the algorithm discussed in this interesting paper. The paper is full of analysis beyond the simple calculation of the expected number of one hits. It is a good read if you’re interested in caching.
The analysis presented here provides a bit of detail on the statistics of one hits. It would be interesting to try to derive asymptotic results from the conditional expectation construction perhaps with a bit of fixed-point analysis and continuous approximation of the binomial. But that will have to wait for another day.
PolyScale provides automated caching with sophisticated multi-tier cache invalidation out of the box. Sign up for your own free PolyScale account here (no credit card required). Connect your database to a cache and start delivering your data around the globe, faster. Or try out PolyScale without bringing your own database with our live interactive demo playground.