Stream computing is often the outlier in discussions about big data architectures—but it shouldn’t be.
The core role of stream computing is to power extremely low-latency velocities, but it doesn’t rely on high-volume storage to do its job. By contrast, the big data platforms that often gain the most mindshare—the massively parallel processing architectures underlying enterprise data warehouses, Hadoop, and other analytics databases—usually require high-volume storage. This storage can have a considerable physical footprint within the data center and is therefore generally more visible than a stream computing architecture, which might be distributed across smaller servers in many data centers.
Clearly, a balanced big data architecture—one that enables maximum velocity, volume, and variety—needs stream computing to supplement and integrate with other approaches. From an architectural standpoint, a comprehensive big data platform provides a latency-agile resource that persists, aggregates, and processes any dynamic mix of at-rest and in-motion information. It’s best to think of this comprehensive big data fabric as consisting of multiple Fit for Purpose platforms that incorporate specialized “data persistence” architectures for both short-latency persistence (caching) of in-motion data (stream computing) and long-latency persistence (storage) of at-rest data (from the enterprise data warehouse, Hadoop, and so on). Each Fit for Purpose persistence platform can be optimized to execute the various analytic models, workloads, and jobs associated with the type of data it is designed to handle.
The practical distinctions are blurring between these Fit for Purpose big data platforms. Stream computing architectures increasingly process many of the same types of analytics that you might also execute on your Hadoop or EDW platforms. In addition, stream computing platforms supplement the out-of-box multi-latency capabilities of EDW, Hadoop, and other big data platforms. For example, all of IBM’s core big data platforms—InfoSphere Streams (stream computing), InfoSphere BigInsights (Hadoop), and IBM PureData (EDW)—can execute MapReduce models for advanced analytics. IBM InfoSphere Streams rapidly ingests, analyzes, and correlates information as it arrives from real-time sources.
In the recently released Streams 3.0, IBM has taken its stream computing functionality to the next level of scale and sophistication. Whether deployed on its own or in conjunction with other platforms, Streams now:
In addition to all these new features, Streams supports several optional IBM solution accelerators to support custom application development of several key real-time big data applications. Accelerators supported in this release include the Time Series Accelerator, Geospatial Accelerator, IBM Accelerator for Telecommunications Event Data Analytics, IBM Accelerator for Social Data Analytics, and IBM Accelerator for Machine Data Analytics. Streams 3.0 also comes standard with several toolkits—Financial, Mining, Complex Event Processing, and Advanced Text Analytics—to help provide quicker time to value on low-latency big data projects in any of these areas.
In many ways, stream computing (as implemented in IBM InfoSphere Streams) is a full-fledged, enterprise-grade runtime engine and development platform for the vast range of real-time big data applications. But stream computing can also be deployed outside of big data environments as a low-latency data integration technology for operational business intelligence, business event monitoring, and other applications that don’t require large volumes or wide varieties of data types.
In other words, stream computing can also play a central role in otherwise “small data” applications.
What do you think? Let me know in the comments.