Many streaming apps use cases help businesses first understand what is going on, and then provide the insights into what to do about it.
RTInsights recently sat down with Chris Sachs, Founder and CTO at Swim, to discuss why streaming data is so important to businesses today. During our conversation, we explored how streaming applications are different than what is traditionally considered data-in-motion apps. We also delved into the challenges of making use of streaming data, the benefits of doing so, and common use cases for streaming apps.
Here is a summary of our conversation.
RTInsights: What is streaming data, and why is it so important to businesses today?
Sachs: I like to frame the overall goal of streaming data as follows: businesses are looking to use streaming data to understand what’s happening right now, to classify what the current situation means for the business, and to decide what to do about it. These are the overarching objectives of most of the streaming data use cases we encounter.
For the business, the streaming data essentially offers a picture of what’s happening now, or more specifically, what’s changing now. But before that picture can be acted upon, meaning needs to be assigned to it. Is this good or bad? Is it expected or unexpected? And if it’s bad, what can be done to improve the situation? What can be done to achieve more optimal, productive, or cost-effective operations? What can be done to improve user experience?
Ideally, the endgame for much of the digital transformation around streaming data is to drive large-scale business automation and to automate moment-by-moment decisions in furtherance of the business objectives. The streaming data is being used to understand the current state and enable that automation.
In terms of the data itself, since it’s offering insight about what’s happening in the world, it can be any type of information or event, including sensor data, clickstream data, sentiment analysis, geo-location, text message, etc. It’s some stream of information about what’s happening and what’s changing.
Big data, which came before, solved the problem of how to do the coarse steering of the enterprise on a quarter-by-quarter basis. Big data is very effective at directing or steering these large enterprise ships. What big data is not great at is the micro-adjustments, the small adjustments on an hour-by-hour, minute-by-minute, or even second-by-second basis.
That’s where streaming data enters the equation. Big data does the coarse steering. The goal of streaming data is to provide that fine steering, which is that next level of optimization and automation that businesses are looking for.
RTInsights: How do businesses use streaming data?
Sachs: There are three broad categories of how streaming data is used today: event processing, streaming analytics, and observability. Let’s take each in turn.
Event processing is really the “T” in ETL. It’s the transformation. The primary way that streaming data is used today is it gets extracted from some source, some set of devices, websites, or apps. It is then transformed with event processing and then loaded into a database. Ironically, streaming data isn’t streaming for very long.
Most streaming data only streams one hop over the network at a time. These are short-lived streams, and they are often very large with many, many events. But the data is streamed one hop, perhaps transformed or analyzed a little bit on the fly, and then written to a database. As a result, the overwhelming majority of applications and data processing happens to that data in the database when the data is no longer streaming.
Event processing is big. It is transforming, maybe normalizing, or pre-processing data for machine learning. You might have devices from different vendors that publish data in different formats. Event processing is useful for essentially the equivalent of converting Fahrenheit to Celsius, that kind of thing. You sort of normalize your data with event processors.
Streaming analytics is the second most common way that businesses use streaming data. And it’s very different from big data analytics, where the name of the game is MapReduce. That was Hadoop’s big thing. MapReduce really just means transforming and reducing to take many data points and turn them into fewer data points.
Streaming analytics really struggles with the reduced part. Streaming analytics can do the map part, that’s the transform, the T in the ETL, and the event processing. Where streaming analytics is helpful is for detecting trends in a whole stream of data. Is this our average? Are our delivery times or click-through rates trending up or down? What’s the statistical distribution of how long users view a webpage?
Streaming analytics applications tend to be these moving window analytics. You look at some mini-batch or some little window of time, maybe the last five minutes, and you do some analyses five minutes at a time, so that’s streaming analytics.
Observability metrics is another key use for streaming data. There’s a lot of talk in the industry about observability. Essentially the focus is on extracting metrics. What’s the memory use of a server? What’s the speed of a connected vehicle? What’s the data throughput of a router? You extract such metrics, which then get inserted into a database to visualize or give some visibility into these metrics.
The actual observability of those metrics is not streaming. You extract the metrics from the stream, they get written to a database, and then nothing happens with them until a user queries them. So, I think observability is almost a misnomer here because we’re still dependent on a human or some other system to query the data to act on it.
RTInsights: What challenges do businesses face when they try to make use of streaming data?
Sachs: There’s a gap between how businesses use streaming data today and what they want to do with it. So, the challenges business face is overcoming this gap.
The first challenge is interpreting data in the context of a business and what the business cares about. The business wants to understand what’s going on, and they want to classify it. Is it good or bad? They also want to know how to take action on it.
So, a metric might say this is how many packets were transmitted by a router or how fast a driver is driving. To make sense of that data, you need to put it in context. Saying, “A bus is driving 60 miles per hour.” Is that good or bad? The answer depends. If it’s in a school zone, it’s bad. If it’s on the freeway, it’s okay.
What you find in almost every case with streaming data is that to make sense of it, assign a value to it, interpret it as good or bad, or infer an action to take, you must put that data in context. And you need to put it in context quickly because there’s a very short half-life to streaming data. The value of that data drops off quickly.
For example, suppose a customer enters a store in a shopping mall. You might get a streaming data event from a device sensor that detects the user’s phone, and you may want to send a notification to that customer with an offer. Similarly, a customer might be walking by a store, and you might want to send them a notice to entice them to come in. It’s not helpful to send somebody an offer an hour later when they’ve already left and gone home. There are various ways of getting streaming data that represents that event, but it’s only useful for so long.
Let’s take another contextual example. The streaming data might be the MAC address of a user’s device. Knowing that a device with some random stringing of numbers walked by is only useful if you can put it in context. And in context, it might be whose device is that? Is it an important customer? How often do they come by? Is it a big spender? All of this extra context.
Putting data in context is essential to understanding what’s going on and then figuring out what to do with the information, such as sending in an automated offer. Streaming data really struggles with context because, as I mentioned before, the way streaming data is used as a sort of event processing. And event processing takes one event at a time and analyzes or transforms that event.
You’re often talking about millions of events per second. So, these are very high-rate data feeds. The traditional approach is when you get a message, if you want to put it in context, you have to query a whole bunch of databases.
For example, let’s say you have a million events per second and need ten pieces of context to make sense of the data. Now that’s 10 million queries per second. That’s not going to scale. Or it’s going to be extremely expensive to scale. As a result, businesses struggle with the cost and scalability of putting events in context. And this is why most streaming data just gets written to a database and is still analyzed in batches. It is because it’s so expensive to put streaming data in context.
Once you’ve enriched your streaming data with context, the second piece is to run business logic against it. This is where the rules and the domain knowledge of your business, such as when to make an offer to a mall shopper, come in. That’s business logic, and it’s different than analytics.
You can think of analytics as looking down at the data and understanding the mathematical properties of data, which is very useful. But whether or not to make an offer to a customer is not an analytics question. There may be analytics that feed into it, but it’s ultimately the business logic. And it’s that logic that requires an understanding of the goal, rules, and objectives of the business.
Running business logic against streaming data is another major challenge that businesses face. This challenge is that analytics systems are extremely restrictive in the kind of computations they can run. They essentially can run what are called pure associative functions, which are very strict mathematical properties that the computations have to run.
Most business logic does not conform to the mathematical requirements of analytics engines, so running business logic is expensive. This, again, is why most streaming data gets written to a database and processed (factoring in the business logic) later.
As a result, businesses that want to drive automation from streaming data are unable to do so. Or they’re very limited in their ability to do so because they can’t get the context that they need to drive those automated decisions. And they can’t run the business logic they need to actually take those automated actions.
Furthermore, they can’t visualize the streaming data. So, there’s little oversight, tool observability, or actual visibility to see what’s happening in these extremely complex dynamic systems. If you’re a large telecom provider with a cellular network, it’s quite difficult to see what’s happening in the network. You have gratuitous amounts of data, but there’s really no visibility of what’s happening now. Having that visibility and oversight is especially difficult.
RTInsights: There are things out there like Kafka and Confluent. How is what you are talking about different, and how do these things work together?
Sachs: The way to think about Kafka and Confluent is as a pipeline for data. One of our customers described streaming data as “oil spurting out of the ground” in their backyard. They know it’s valuable, they’re drowning in the spurting oil, and they’re not entirely sure what to do with it. They have a vague sense that they’d like to measure it. That’s the sort of metric they are interested in. They’d also like to distribute and act on it.
With spurting data, Kafka gives you pipes. You can pipe it here or there or someplace else. That’s really important. It’s essential to be able to take that streaming data and move it. The oil is spurting out of the ground here, and you need it in this data center, or this edge compute location over there. And hence the name “Data-in-motion.” And Confluent will help you move that data.
Keeping the same analogy. When you put Kafka in place, now the oil’s just dumping onto the ground somewhere else. So, you succeeded in moving the data, but it’s still spilling out again somewhere else. The question is, where should that data go? What’s the ultimate destination for that data? That’s sort of where Confluent stops. They’ll give you a pipe, and then it’s up to you to get a bucket to catch all that data squirting out of the pipe and then figure out what to do with it. So, I catch it in a database. In summary, the current state of the world is that data comes out of the ground, gets moved, and then gets pumped into a database, which is a type of oil depot.
That’s what data in motion is. It’s the pipes. But businesses want to do more than just move the data. Moving data doesn’t answer what’s going on or what to do about it. So, most of the value derived from streaming data happens after that fuel is in the depot. By the time it gets through that oil refinery, a lot of time has passed, and a lot of cost is being accumulated. The state today is that applications and business logic, and automation do not run on the streaming data. It runs on the data at rest.
So, you have data in motion for the short jog from where it’s generated into a depot, and then it becomes data at rest again. Most of the tooling in the industry still runs on that data at rest, not on data in motion. What’s needed is the ability to put data in context. That’s done in the application layer.
Data pipelines are built-in layers where you typically have a data layer, maybe an analytics layer, and then an application layer. The application layer is where the business logic runs, where the automation happens, and where the knowledge of the business is. And right now, the analytics layer partially works with streaming data. The application layer essentially does not work at all with streaming data.
To meet today’s business needs, the application layer has to be modernized to work with this new streaming data. Why? The application layer as it exists today is relatively stagnant. Application services haven’t really changed for the better part of 20 years. They’re still designed for the early 2000s for database-driven web applications. They weren’t designed for streaming data, and they’re unsuitable for streaming data in many ways.
Bringing the application platforms into the mix closes the gap between how businesses currently use the streaming data and how they want to use and apply it to the business. This has to be done in such a way so that their applications can run directly on streaming data instead of being tied to the oil depot.
RTInsights: Can you give some examples or use cases of that?
Sachs: As I said before, many use cases are going to fall into the categories of understanding what the heck is going on and what to do about it. The first step that we see most streaming data users take is that they build a real-time model of their whole business. And this often comes down to customers creating digital twins of the business entities they care about.
These digital twins are a bit different notion than the digital twins’ term that’s been around for a long time. Originally, digital twins arose in a manufacturing environment. We’re talking about a much more expansive definition of digital twins here. Businesses want to create digital twins of their vehicles, delivery routes, customers, infrastructure, and more; essentially every digital and physical asset. And they want to feed those digital twins with streaming data, so they have a live picture of everything that’s happening in their business and how it’s related.
Such capabilities answer the question I continue to bring up, which is, “what the heck is going on.? You could create a real-time picture of a network or food delivery service based on the business needs. An online shopping company like Instacart wants a real-time inventory of all the stores in an area. It wants digital twins of all the drivers, buyers, and shoppers. So, you have this detailed model of where everything is, what’s in stock, who’s on time, who’s late, who’s ordering what, what’s in who’s shopping basket, and what the replacements are. You create this real-time model of the whole world. And that again answers the question of what’s going on.
Now you really have observability. You have much more meaningful observability than just metrics like the CPU use in a server. You’ve got true visibility of what’s going on in your whole enterprise. These digital twins capture that all-important context. Now you have that picture, and it’s immediately usable. It’s usable by humans. If there’s a failure or somebody calls to complain about a late order, at least humans have access to this live picture of what’s going on.
Sometimes it’s referred to as building a large state machine of the enterprise. It gives you an understanding of the state of every customer, their experience, and how it’s changing. The next step is what do you do about it? Now you have this picture of what’s going on, and this is where the real interesting use cases come from.
Once you have that picture, you can start running business logic in these digital twins to start making decisions. For example, you can send offers to passing-by shoppers. Another use case is to make dynamic routing decisions. If you’re an Instacart-type company and a shopper is driving to a store, your model might notice that the store they’re driving to just ran out of stock of some of the key items the customer’s looking for. You might make the decision to redirect that driver to a different store.
If you’re a telecom provider, you might have an outage or a network failure, and you want to do some automated root cause analysis. Today, root cause analysis can be very time-consuming and expensive to perform. And it needs a lot of context to understand why something failed. Again, we’re back to this issue of context. You might do some root cause analysis so that you can then automate the remediation of that failure, such as rolling back the configuration of a device.
Another use case is for automating coordination. If you’re a company like Walmart that provides free overnight shipping of packages to stores, you may want to be able to notify a customer when they can leave their home to drive to the Walmart store to pick up their package. In this case, you are automating notifications about a future event. Again, that requires lots of context.
Another common use case revolves around the idea of experience scoring, which is at the heart of the “Customer 360” idea that Confluence talks about. The idea is you want to model every customer and know everything about that customer so that you can understand their customer experience. You want to be able to notice, in real time, that the experience starts to go bad. Knowing this, you can intercede with a message apologizing, avoid problems before they crop up to ensure a great customer experience, reduced customer churn in competitive markets, and manage other customer-related experiences. Those are only a few use cases, but the reality is use cases are as simple, complex, or creative as each company needs or requires, and that’s what’s so exciting about streaming data applications. While the essential ingredients of the streaming data application “recipe” is there, each use case “meal” can be tailored to the business’s unique needs.
RTInsights: Many users are paralyzed on how to even start this stuff. Any advice on how to get started?
Sachs: It can be a bit overwhelming. This streaming data transformation is happening rather quickly. Time scales just continue to compress. We maybe had ten years to get used to using streaming data. But the world is so competitive and fast-moving. Just in the last five years, the amount of deployments and streaming data in the world has grown exponentially and will continue to grow.
If your competitors are building AI and automation systems to optimize their operations and drive compelling real-time experiences, they’re going to win. Or they are going to have a major leg up. I think an arms race is shaping up to adopt and make use of streaming data. All while functioning in a very different world. With big data, it was databases, but bigger. With streaming data, the whole flow of information is being turned on its head.
It can be overwhelming to have that oil squirting out of the ground and try to figure out what to do with it. The way to tackle it is to start with that end state. This is a piece that gets missed. There’s a lot of focus on the infrastructure to deploy or the kinds of analytics to run. Those decisions are not always rooted in the endgame business objectives.
The first step is to think as a business thinks. What do you want to understand about what’s going on? What do you want to model? What do you want to characterize or classify? What are the things that you want to automate? Do you want to automate the understanding of the customer experience and how to improve it?
Identify that end state and work backward from there. Taking the Walmart example I mentioned, the end state may be you want to project very precisely if packages are going to be delivered late so that we can notify customers not to head for the store. You need to then identify the input data sources and context you need to make that decision. And what are the business entities you need to model?
Similar to object-relational mapping, when you’re building a database, you want to model your business but with an eye toward what’s happening now. Consider looking at it from a different perspective of where customers are now versus where the packages are. The first action you’re going to take in terms of understanding what’s happening and what to do about it is to create models. You’re going to create digital twins.
Think about what business entities you want to create digital twins of and what’s all of the information, all of the context you can gather that might relate to these digital twins to try and understand the truth. If you can understand the truth of what’s happening in your business, you’re halfway there. And just pumping the most comprehensive, precise single source of truth, a lot of value’s going to fall out of that. You’ll get that picture of what’s going on. While this may seem complex, taking enough time upfront to understand the information and context required will save time, money, resources, and heartburn as you develop applications.
Once you have that source of truth, then you think of it like a state machine. Okay, this driver’s late, this package is on time, or this customer is happy. These are states. This customer is a big spender and is at risk of leaving for a competitor. Those are all the states in this big state machine of your business.
Then you want to think, what are the actions to take when one of these states changes? So, what’s the action you want to take when a customer goes from not at risk of leaving for a competitor to high risk of leaving or low risk of leaving? What other data sets in context do you need to make that decision?
Start small and start building. Just go straight for automation. You don’t need to spend five years deploying infrastructure. Build an end-to-end use case quickly. And then repeat that for multiple use cases instead of embarking on a large sort of project to deploy a bunch of infrastructure and hope you get the result you want.