In spite of what you may have heard, Hadoop is not the sum total of big data.
Another big data “H”—hybrid—is becoming dominant, and Hadoop is an important (but not all-encompassing) component of it. In the larger evolutionary perspective, big data is evolving into a hybridized paradigm under which Hadoop, massively parallel processing (MPP) enterprise data warehouses (EDW), in-memory columnar, stream computing, NoSQL, document databases, and other approaches support extreme analytics in the cloud.
Hybrid architectures address the heterogeneous reality of big data environments and respond to the need to incorporate both established and new analytic database approaches into a common architecture. The fundamental principle of hybrid architectures is that each constituent big data platform is fit-for-purpose to the role for which it’s best suited. These big data deployment roles may include any or all of the following:
In any role, a fit-for-purpose big data platform often supports specific data sources, workloads, applications, and users.
Hybrid is the future of big data because users increasingly realize that no single type of analytic platform is always best for all requirements. Also, platform churn—plus the heterogeneity it usually produces—will make hybrid architectures more common in big data deployments. The inexorable trend is toward hybrid environments that address the following enterprise big data imperatives:
Hybrid deployments are already widespread in many real-world big data deployments. The most typical are the three-tier—also called “hub-and-spoke”—architectures. These environments may have, for example, Hadoop (e.g., IBM InfoSphere BigInsights) in the data acquisition, collection, staging, preprocessing, and transformation layer; relational-based MPP EDWs (e.g., IBM PureData System for Analytics) in the hub/governance layer; and in-memory databases (e.g., IBM Cognos TM1) in the access and interaction layer.
The complexity of hybrid architectures depends on range of sources, workloads, and applications you’re trying to support. In the back-end staging tier, you might need different preprocessing clusters for each of the disparate sources: structured, semi-structured, and unstructured. In the hub tier, you may need disparate clusters configured with different underlying data platforms—RDBMS, stream computing, HDFS, HBase, Cassandra, NoSQL, and so on—-and corresponding metadata, governance, and in-database execution components. And in the front-end access tier, you might require various combinations of in-memory, columnar, OLAP, dimensionless, and other database technologies to deliver the requisite performance on diverse analytic applications, ranging from operational BI to advanced analytics and complex event processing.
Ensuring that hybrid big data architectures stay cost-effective demands the following multipronged approach to optimization of distributed storage:
Yes, more storage tiers can easily mean more tears. The complexities, costs, and headaches of these multi-tier hybridized architectures will drive you toward greater consolidation, where it’s feasible.
But it may not be as feasible as you wish.
The hybrid big data environment will continue the long-term trend away from centralized and hub-and-spoke topologies toward the new worlds of cloud-oriented and federated architectures. The hybrid platform is evolving away from a single master “schema” and more toward database virtualization behind a semantic abstraction layer. Under this new paradigm, the hybrid big data environment will require virtualized access to the disparate schemas of the relational, dimensional, and other constitute DBMS and other repositories that constitute a logically unified cloud-oriented resource.
Our best hope is that the abstraction/virtualization layer of the hybrid environment will reduce tears, even as tiers proliferate. If it can provide your big data professionals with logically unified access, modeling, deployment, optimization, and management of this heterogeneous resource, wouldn’t you go for it?
The architectural centerpiece of this new hybridized landscape must be a standard query-virtualization or abstraction layer that supports transparent SQL access to any and all back-end platforms. SQL will continue to be the lingua franca for all analytics and transactional database applications. Consequently, big data solution providers absolutely must allow SQL developers to transparently tap into the full range of big data platforms, current and future, without modifying their code.
Unfortunately, the big data industry still lacks a consensus query-virtualization approach. Today’s big data developers must wrangle with a plethora of SQL-like languages for big data access, query, and manipulation, including HiveQL, CassandraQL, JAQL, SQOOP, Sparql, Shark, and DrQL. Many, but not all, of these of these are associated with a specific type of big data platform—most often, it’s with Hadoop. I’m including IBM BigSQL (currently in Technology Preview) in this list of industry initiatives.
The fact that we refer to many of these initiatives as “SQL-on-Hadoop” is a danger sign. We, as an industry, need to go one step beyond this idea. The big data arena threatens to split into diverse, siloed platforms unless we bring SQL fully into it all as a lingua franca.
Siloed query languages and frameworks threaten to ramp up the cost, complexity, incompatibility, risk, and unmanageability of multiplatform big data environments. And the situation is likely to grow more fragmented as big data hybrid deployments predominate.
The bottom line is that hybrid big data environments will degenerate into a mess of incompatible platforms unless the industry puts a renewed focus on standardization.
What do you think? Let me know in the comments.
December 10, 2013 – 1 PM Eastern
December 11, 2013 – 12 PM Eastern
December 17, 2013 – 11:30 AM Eastern
December 18, 2013 – 1 PM Eastern
December 19, 2013 – 12:30 PM Eastern
January 21, 2014 – 11 AM Eastern