Big Data and Warehousing

True Hadoop Standards are Essential for Sustaining Industry Momentum: Part 2

A suggested framework for Hadoop industry standardization

In Part 1 of this article, I highlighted the essential need for Hadoop standards. The Apache community should submit Hadoop to a formal standardization process under an industry forum—either an established group or one that is specifically focused on big data. Under such an effort, the Hadoop industry should clarify the reference framework within which new Hadoop specifications are developed in a way that safeguards the community’s ability to evolve the core code bases, innovate vigorously, and differentiate competitively in areas that don’t jeopardize community-wide interoperability.

At minimum, the Hadoop industry reference framework would specify clear service layers by functional areas, with clear interfaces or abstractions to ensure interoperability across these layers. The core functional service layers should include:

  • Hadoop modeling and development: MapReduce, Pig, Mahout, and so on
  • Hadoop storage and data management: HDFS, HBase, Cassandral, and so on
  • Hadoop data warehousing, summarization, and query: Hive, Sqoop, and so on
  • Hadoop data collection, aggregation, and analysis: Chukwa, Flume, and so on
  • Hadoop metadata, table, and schema management: HCatalog, and so on
  • Hadoop cluster management, job scheduling, and workflow: Zookeeper, Oozie, Ambari, and so on

The industry should develop the Hadoop reference framework to address the key big data use cases in which organizations are deploying this technology:

  • Hadoop for data staging, transformation, cleansing, and preprocessing: Many enterprises use Hadoop for strategic roles in their current warehousing architectures, especially extract/transform/load, data staging, and preprocessing of unstructured content. Often, Hadoop clusters perform these critical data preparation roles to support the creation of analytical datasets data scientists use to build their MapReduce, R, machine learning, and other data analytic models. Hadoop clusters offer the massively parallel power to crunch petabytes of complex content in short order.
  • Hadoop for advanced analytics development sandboxing: Hadoop has proven itself a strategic basis for big data development “sandboxes.” This use case, which is common among many Hadoop early adopters, involves providing data scientist teams with consolidated, petabyte-scalable data repositories for interactive exploration, statistical correlation, and predictive modeling. The sandboxing use case puts a high priority on integrating the Hadoop platform with a rich library of statistical and mathematical algorithms. It also emphasizes tools for automated sandbox provisioning, fast data loading and integration, job scheduling and coordination, MapReduce modeling and scoring, model management, interactive exploration, and advanced visualization. As the repository for valuable unstructured data—such as geospatial, social, and sensor information—Hadoop can play a central role in any big data initiatives. In this way, Hadoop can supplement, rather than replace, the analytic sandboxes an organization has implemented to support modeling with tools such as IBM® SPSS®, which often focus on more traditional structured data from customer relationship management and enterprise resource planning systems.
  • Hadoop for data warehousing: IBM expects Hadoop and data warehousing to tie the knot more completely over the next several years and converge into a new platform paradigm: the Hadoop data warehouse. Although Hadoop won’t render traditional warehousing architectures obsolete, it will supplement and extend the data warehouse to support a single version of the truth, data governance, and master data management for multi-structured data that exists in at least two of the following formats: structured (such as relational or tabular), semi-structured (including XML-tagged free-text files), and/or unstructured (for example, ASCII and other free-text formats). In many ways, the data warehouse and Hadoop already share a common underlying architectural approach—both in the IBM architecture and in the industry at large. The chief features of this shared approach are massively parallel processing, in-database analytics, mixed workload management, and flexible storage layers.

In addition, there should be standard industry performance benchmarks for Hadoop, addressing these use cases and their most characteristic workloads. The Hadoop market has matured to the point where users now have plenty of high-performance options, including IBM InfoSphere® BigInsights™. The core open-source Hadoop stack is common across most commercial solutions, including BigInsights. The core mapping and reducing functions are well defined and capable of considerable performance enhancement, leveraging proven approaches such as Adaptive MapReduce, which is at the heart of BigInsights. Customers are increasingly using performance as a key criterion to compare different vendors’ Hadoop offerings, and often use various sort benchmarks to guide their evaluations. Many are demanding that the industry adopt a clear, consensus approach to performance claims in some core operations, including NameNode operations, HDFS read/writes, MapReduce jobs (map, reduces, sorts, shuffles, and merges), and compression/decompression.

 

Hadoop standards need to consider the broader big data industry landscape

Hadoop standards must play well in the sprawling tableau of both established and emerging big data technologies. The larger picture is that the enterprise data warehouse is evolving into a virtualized cloud ecosystem in which relational, columnar, and other database architectures will coexist in a pluggable big data storage layer alongside HDFS, HBase, Cassandra, graph databases, and other NoSQL platforms.

Hadoop standards will form part of a broader, but still largely undefined, service-oriented virtualization architecture for inline analytics. Under this paradigm, developers will create inline analytic models that deploy to a dizzying range of clouds, event streams, file systems, databases, complex event processing platforms, and next-best-action platforms.

The Hadoop reference framework should be developed according to principles that preserve and extend interoperability with the growing range of other big data platforms in use, such as data warehousing, stream computing, in-memory, columnar, NoSQL, and graph databases.

In my opinion, Hadoop’s pivotal specification in this larger evolution is MapReduce. Within the big data cosmos, MapReduce will be a major unifying development framework supported by many database and integration platforms. Currently, IBM supports MapReduce models in both the Hadoop offering, InfoSphere BigInsights, and in the stream-computing platform, InfoSphere Streams.

In terms of particular new Hadoop specifications that would benefit the entire market and facilitate cross-platform interoperability, a multi-language query abstraction layer would be a much-needed addition to address the heterogeneous big data universe we’re living in. Such a specification would virtualize the diverse, confusing range of query languages—HiveQL, CassandraQL, JAQL, SQOOP (SQL to Hadoop), Sparql, and so on—in use within the Hadoop and NoSQL communities. Having a unified query abstraction layer would enable more flexible topologies of Hadoop and non-Hadoop platforms in a common big data architecture, reflecting the work of many early adopters that had to build custom integration code to support their environments.

Who will take the first necessary step to move the Hadoop community toward more formal standardization? That’s a big open issue.
 

 
Previous post

Demystifying Big Data

Next post

Get the Most out of Your Data Warehouse

James Kobielus

James Kobielus is a big data evangelist at IBM and the editor in chief of IBM Data magazine. He is an industry veteran, a popular speaker, social media participant, and a thought leader in big data, Apache Hadoop, enterprise data warehousing, advanced analytics, business intelligence, data management, and next-best action technologies.