Remember me

Register  |   Lost password?


StatAlgo Logo

A general introduction to stream processing

Mon, 19 May 2014 10:37:36 GMT

The topic of "real-time computing" (RTC) has grown substantially both in substance and popularity over the past few years alongside the rise of big data and data science. The related topic of data stream mining uses real-time computing to derive information from streams of data. This places additional focus on the end-to-end latency and robustness of the systems, and thus requires streaming algorithms.

What is real-time computing?

A system is said to be real-time if the total correctness of an operation depends not only upon its logical correctness, but also upon the time in which it is performed. -Wikipedia

For the sake of clarity, since I'm focusing on the application to data analysis, I consider "real-time computing" to cover computations that are subject to a time constraint.

I am particularly interested in soft real-time systems, where there is a time decay to the information being processed. Real-time models work in this way: if we cannot react quickly to the input data, then the output response is less valid.

The Building Blocks of Stream Processing

Over the next series of blog posts, I will discuss several issues relevant to doing real-time computation (particularly on big data), different implementations of real-time computing frameworks, and then conclude by implementing a basic framework myself. The goal is to develop a general abstraction for doing real-time computation. There are some general topics/design patterns that I will cover throughout:

Many of these topics are standard tools for any large scale computing. Underlying many of the systems that we will consider are messaging systems, which are often used to implement near-realtime asynchronous computation. A typical pattern is for messages to be pushed onto a message queue and then downstream systems subscribe to these message brokers (e.g. ActiveMQ, RabbitMQ, Kafka). Outside of a focus on messaging systems, an end-to-end stream processing system will need to consider how messages are stored and recovered, and how the messages are processed once they are taken off the queue (e.g. with a CEP system).

Complex Event Processing (CEP) is a very popular paradigm for working with data streams. This allows the creation of relatively complex algorithms from basic building blocks such as stream joins, filtering, and aggregation.


I review some general aspects of a handful of complete open-source frameworks, mostly focusing on Storm and Druid:

There are a handful of open-source CEP engines, but I will focus on:

Underlying all of these applications is a requirement to understand messaging. One nice resource for this is I have used several of these in the past. I will briefly touch on:

Following an exploration of these frameworks, I will move on to implement a simple end-to-end system myself. For this section, I will walk through how to implement an event handler using the callback design pattern, and also compare this against Boost.Signals2.

I will conclude everything by benchmarking the performance of the different implementations.

, , , , , , , , , , , , , , , , , ,