Beyond Events: the Stream Abstraction

In an event-driven system, the unit of abstraction is the event. An event is a notification about something happening in the system. Events can be processed and lead to other other events.

In systems relying on streams, the unit of abstraction is the stream. A stream is a sequence of events, finite but also possibly infinite. Rather than processing individual events, you manipulate streams.

If you simply want to react to each event, the difference is insignificant. For more complex situations, using streams makes it easier to express the processing. Streams are manipulated with various operators, like map, flatMap, join, etc. Implementing windowing is for instance a trivial task with streams – just use the right operator – whereas it would be a complicated task using only events.

merge

One main use case for streams is to implement data pipelines. In this case we speak of stream processing. This is what Apache Flink and Kafka Streams are for. Stream processing is typically distributed on several machines to processe large quantities of data. The processing must be fault-tolerant and accomodate the failures of individual processors. This means that such technologies have sophisticated approches to state management. In the case of Kafka Stream, part of the heavy lifting is delegated to Kafka itself. Durable logs enable the system to resume after a failure, if needed reprocessing some data a second time.

Streams can also be used within applications to locally process data. This is what RxJava and Akka Streams are for. This tend to be referred as reactive programming and reactive streams. You use reactive programming to process asynchronous data, for instance video frames that need to be buffered. Rather than using promises or async/await to handle concurrency, you use streams.

There are many similarities between stream processing and reactive programming but also differences. In both cases, we find sources, streams, and sinks for events. In both case, you have issues with flow control. That is, making sure the producers and consumers can work at different paces. Since both use cases differ, the abstractions might differ, though. Streams in reactive programming supports for instance some form or exception handling, similar to regular java exceptions. Exception handling in stream processing is different. With reactive programming, buffering will be in-memory. With stream processing, buffering can be on disk (e.g. using a distributed log).

The stream, as an abstraction, is a relatively young one. It isn’t as well established as, say, relational databases. The terminology varies across products as well as concepts. The difference between stream processing and reactive programming is also not fully understood. For some scenario, the differences are irrelevant. As evidence that the field matures, some efforts to standardize the concepts have already started. The new java.util.Flow package is a standard API for sources (called publisher), streams (called subscription), sinks (called subscriber) in reactive programming. Alone, it doesn’t come with any standardized operator, though. This makes its usefullness at the moment limited to me. Project Reactor‘s aim is similar and it’s an implementation of the reactive streams specification that is embeddable is various framework, e.g. spring. Its integration in spring cloud stream effectively bridges the gap between reactive programming and stream processing.

The stream, as an abstraction, is very simple but very powerful. Precisely because of this, I believes it has the potential to become a core abstraction in computer science. It takes a bit of practice to think in terms of streams, and once you get it, you see possible applications of streams everywhere. So let’s stream everything!

More

Leave a comment Cancel reply