Smarter consolidation into Hadoop platforms
If you think Apache Hadoop is 100 percent ready to serve as a consolidated single-version-of-the-truth repository, think again. Yes, Hadoop is fast becoming a core component of most organizations’ big data strategies. But it’s not mature enough to completely replace your enterprise data warehouse (EDW). For all its strengths as an unstructured data integration layer, most Hadoop environments lack the robust security, availability, and governance that are standard in a mature EDW. These and other typical EDW-grade features are coming to Hadoop through both open-source and commercial distributions, but maturation is still one to three years away.
Here and now, it’s smarter to consider Hadoop a tactical consolidation platform for specific analytic purposes and data sources. Most notably, Hadoop has proven itself a strategic basis for big data development “sandboxes.” This use case, which is common among many Hadoop early adopters, involves providing data scientist teams with consolidated, petabyte-scalable data repositories for interactive exploration, statistical correlation, and predictive modeling.
As the holding pen for valuable unstructured data—such as geospatial, social, and sensor information—Hadoop can play a central role in any big data initiatives. In this way, Hadoop can supplement, rather than replace, the analytic sandboxes an organization has implemented to support modeling with tools such as IBM SPSS, which often focus on more traditional structured data from customer relationship management and enterprise resource planning systems. Consequently, Hadoop might not be, nor does it need to be, the sole consolidated sandbox for all advanced analytics.
In this sandboxing use case, the mature EDW features mentioned above are lower priority than using Hadoop as the consolidation platform for your EDW or operational data store. By the same token, the sandboxing use case places a high priority on integration of the Hadoop platform with a rich library of statistical and mathematical algorithms. It also places a key emphasis on tools for automated sandbox provisioning, fast data loading and integration, job scheduling and coordination, MapReduce modeling and scoring, model management, interactive exploration, and advanced visualization.
As you start to consolidate more operational analytics jobs on Hadoop clusters, you may find that instead of dumping everything into a one-size-fits-all cluster, it’s smarter to configure different clusters for different purposes. For example, Hadoop Distributed File System is probably sufficient for batch MapReduce jobs. Real-time jobs might run best on clusters and nodes optimized for HBase or other low-latency database technologies that integrate with MapReduce execution engines.
Some operational Hadoop deployments may figure into larger application-consolidation initiatives, and may involve integration of Hadoop/MapReduce runtimes for analytical offload from online transaction processing, semantic web, and decision automation environments. In such cases, consider integrating your production Hadoop cluster with non-Hadoop technologies such as the IBM DB2 v10 Resource Description Framework triple-store, or with various other relational, NoSQL, and other databases.
As your organization’s Hadoop/MapReduce use cases and deployment topologies expand in scope, you may find yourself optimizing “fit-for-purpose” clusters or nodes for more fine-grained jobs. While consolidating more operational applications onto Hadoop, one option is to begin dedicating specific clusters or nodes to specific data sources and downstream applications. Also, you can assign dedicated nodes to mission-critical big data support functions as archiving with query for e-discovery, and log correlation for IT root cause analysis.
As the market sees a growing number of Hadoop-optimized data governance, security, cluster management, and other infrastructure tools, think about testing and evaluating these tools thoroughly in stand-alone clusters before deploying them into “single version of the truth” scenarios for operational business intelligence. Among other features, you will need to evaluate Hadoop platforms’ level of integration with your company’s EDW for bilateral data exchange, at the very least. If you’re already doing in-database analytics, see whether each platform can consume the outputs of each others’ model runs.
Smarter consolidation depends on knowing the strengths and limitations of all data analytic platforms, including Hadoop. Consolidating all of an enterprise’s data and analytics into Hadoop isn’t optimal at this time and may never be, even as Hadoop evolves and permeates the EDW and other established approaches. What’s important is that you deploy each approach into the use cases that fit your specific big data environment.