point

Remember me

## Real-time Analytics: Hokusai adds a temporal component to Count-Min Sketch

Sun, 18 Nov 2012 11:40:13 GMT

Introduced in 2003 by Cormode and Muthukrishnan, the Count-Min sketch is a popular and simple algorithm for summarizing1 data streams. In particular it's often used to calculate simple frequencies, identify frequent elements (sometimes referred to as heavy hitters), and compute quantiles (see [1]).

The Count-Min sketch can take advantage of distributed/parallel compute resources. Suppose a data stream has a domain of possible symbols, X = {s1,s2,..., sD}. The Count-Min sketch is comprised of a (d x n) matrix (of counters), M, and hash functions

h1: X -> {0, 1, 2, ..., n-1}
h2: X -> {0, 1, 2, ..., n-1}
...
hd: X -> {0, 1, 2, ..., n-1}

The algorithm for updating and retrieving counts is detailed below:

What makes the Count-Min sketch powerful is that the count estimates it produces comes with decent guarantees. In practice the dimensions of the matrix (of counters) M is tied to parameters (ε, δ)

n = ceiling( e / ε )       , width
d = ceiling( ln(1/δ) )      , depth

If nx is the true count, and cx is the Count-Min sketch estimate, then with probability (1 - δ)

Hokusai2 from Yahoo! Research
The Count-Min sketch provides decent summary statistics for current data. But in many applications, you'll want to be able to answer "historical" queries ("give me the count from noon of the 3rd Sunday in March of last year"). Hadoop/MapReduce can provide exact answers, but not at the latency required by many (near real-time, web) applications.

Techniques that provide approximate answers in real-time would be ideal, but they aren't designed to go back in time to answer "historical" queries. If only the Count-Min sketch kept track of time! Luckily a group of researchers from Yahoo! decided to extend the Count-Min sketch so it can handle queries against (distant) historical data (see [2]). Two observations led them to conclude that one can aggregate sketches on small, non-intersecting time intervals. Given two non-intersecting time intervals, T1 and T2, the Count-Min sketch on their union is:

MT1 ∪ T2 = MT1 + MT2

One can also fold the "... ﬁrst and second half of a Count-Min sketch on itself."

For stream applications that involve large amounts of data, compression becomes important3. The Yahoo! Research team exploited Count-Min data structures, to arrive at different compression strategies:

1. Time Aggregation: One strategy is to perform binary aggregation on the time axis. That is, rather than retaining a 1 minute resolution for 2 years, we only retain a 2m minute resolution for the last 2m minutes. The rationale is that old observations have an exponentially decreasing value and consequently it is sufficient to store them at a concomitantly decreased precision. The basic idea is to keep aggregates Mi for time intervals of length {1,1,2,4,8,...,2m} available.

2. Item Aggregation: An alternative means of aggregating count statistics over time is to retain the full time resolution while sacrificing accuracy. That is, we can shrink the number of bins n over time but we retain the unit time intervals. More specifically, we can halve the resolution each time the Count-Min sketch ages 2n time steps.

3. Resolution Interpolation: Time aggregation yields good estimates over long time intervals, while Item aggregation yields poor estimates over short time intervals. Let n(x) be the item aggregated estimate, n(t) the time aggregated estimate, and n(x,t) the estimate for Item x at Time t. Then the joint distribution n(x,t) can be recovered by assuming the marginal distributions are independent:

The paper closes with tests against a data set of web search queries with close to 100 million unique items. The Time Aggregation and Resolution Interpolation approaches yielded consistently low error rates across the data set (frequency estimates were accurate for frequent and rare items).

References:
1. An Improved Data Stream Summary: The Count-Min Sketch and its Applications by Cormode and Muthukrishnan
2. Hokusai: Sketching Streams in Real Time
3. Count-Min Sketch on Google Sites

(1) The Bloom Filter is a related algorithm that's used to answer "membership queries".
(2) This post was prompted by a recent talk given by Sergiy Matusevych, at the SF Bay Area Machine-learning Meetup.
(3) Performance gains can be huge especially when one is able to compress down to the point where the Count-Min sketch fits into RAM.