Stream processing is another key component of real-time or streaming data infrastructure.
Zander Matheson is building Bytewax, an open source Python framework to build real-time apps using streaming data.
In this episode, Zander explains stream processing in simple terms, touches upon the difference between stateful and stateless stream processing, and describes some of its common use cases and benefits.
This episode concludes the series on real-time analytics.
Let’s dive in:
Q. In simple terms, what is stream processing?
Stream processing very generally is the ability to process over a bounded or unbounded stream of data, so you process one piece of data at a time which generally happens in real time.
Q. Can you briefly explain what Bytewax does and how it works?
Bytewax is an open source framework — a stateful stream processor that allows you to easily build on top of streaming data.
Think of it as a tool to build applications that leverage streaming data. It could be something advanced like an online machine learning algorithm for anomaly detection or it could be something simple where you are just transforming that data in real time. So Bytewax gives you the pieces to connect to streams of data, manipulate them, and then connect to downstream data systems.
Q. What are the key differences between stateful and stateless stream processing?
One of the hard things about processing streaming data is maintaining a picture of what's going on so that you can do more advanced things with it, and that's basically stateful.
Let's say I want to know all the things that happened for a certain user over a period of time, that's when I’d bring in state because I want to have a window of time and then aggregate data about a user, so I have to maintain this information about the user over time.
On the other hand, stateless stream processing only lets you act on one piece of data at a time and it very much simplifies the problem because you don't need to know what else is going on and you can scale it more easily.
Q. How can a tool like Bytewax be used in conjunction with an open source OLAP datastore such as Apache Pinot?
So you can use various systems downstream of Bytewax; what you use depends on either how internally other people are going to interact with the data or how you're going to serve it.
Bytewax could be used for a transformation layer before an OLAP database if you have a bunch of dashboards running on top of real-time data. So with Bytewax, you could do transformations and adjustments to the data, maybe add in third-party data sources and then write it out to Pinot for your system downstream of that to run queries on top of that data and produce dashboards.
Q. What are the top to-use cases of stream processing technologies?
The most common use cases today for stream processing are applications for anomaly detection — things like fraud detection for credit card transactions, or in cybersecurity, detecting when there's some anomalous behavior.
I'm not sure what is the most common but those are two that involve anomaly detection which is a pretty good use case for streaming processing.
Q. Can you explain how stream processing can be used in e-commerce to build better shopping experiences?
Stream processing has really great use cases for personalization in e-commerce. Like Amazon, even small e-commerce sites can leverage stream processing technology to offer personalized recommendations to every shopper.
Q. Which industries benefit the most from stream processing tech? Are there industries where stream processing is non-negotiable?
To answer this question it's best to zoom out.
Stream processing adds complexity to your infrastructure and so it's important that there's an ROI associated with the change that you make.
If you want to leverage stream processing, you need to make sure that that increase in complexity is going to be worth it in the end.
Industries where the closer you can get to real-time for making decisions or improving the user experience are the best industries to adopt real-time tech.
In terms of being non-negotiable, there are plenty of instances in IoT where it's a non-negotiable to have stream processing for connected devices.
You can think of connected cars or situations where you need to make a decision for the user in real time — like Uber and Lyft when they're trying to match the right driver to the right person. Ultimately, people will open Uber and Lyft, request a driver, and go with whichever is faster.
Therefore, for companies like Uber that cannot really exist without real-time technology, it makes sense to increase the complexity of their infrastructure to provide a better experience and ultimately increase revenue.
Q. Last question — what's the one piece of advice you have for companies that are evaluating stream processing or other real-time technologies?
Coming back to what we were just talking about, it's like a double down on that.
Can you increase the revenue generated by the product or decrease your costs? Can you affect the company's bottom line with the adoption of real-time tech?
That's basically it. Because there is an increase in the complexity of the code that you write and the systems you maintain.