As the most widely adopted new big data technology, Hadoop is far too important to modern business strategies to continue without more structured industry governance. The technology’s maturation depends on a coordinated effort to clarify how it will hang together both internally among its growing range of subprojects, and externally with other big data specifications and communities.
The Hadoop market won’t fully mature and may face increasing obstacles to growth and adoption if the industry does not begin soon to converge on a truly standardized core stack. A Hadoop standards framework would provide a grand vision for the evolution of this technology in a broader big-data industry context. Standards are essential so that vendors and users of Hadoop-based solutions can have assured cross-platform interoperability.
Right now, silos reign in the Hadoop world—a situation that is aggravated by the lack of open interoperability standards. Most Hadoop users have built their deployments either on the core Apache Hadoop open-source code or on specific vendors’ distributions of that code base. IBM has adopted the full-core Apache Hadoop open-source stack into the IBM® InfoSphere® BigInsights™ product. However, some vendors have forked the core Apache Hadoop code base within the context of their solution portfolios, developing proprietary extensions, tools, and other components. Some—but not all—Hadoop vendors have contributed some of their code back to the open-source Apache Hadoop community.
Some of this industry divergence from strictly open-source–based Hadoop has been necessary for solution providers to field enterprise-grade offerings in the face of feature gaps in the core Apache Hadoop subprojects. Nevertheless, if the forking and proprietary extensions continue to take place outside a clear set of standard industry reference implementations and interfaces, silo wars may ensue as vendors with market clout try to leverage their first-mover advantage and rally industry support around their proprietary Hadoop implementation.
Let’s look at this from a historical perspective. In the early 2000s, the service-oriented architecture world didn’t begin to mature until industry groups such as OASIS and WS-I stabilized a core group of specs such as WSDL, SOAP, and the like. Today’s big data world badly needs a similar push to standardize the core of Hadoop, its key new open-source–based approach. Hadoop remains squarely in the “de facto standard” camp.
The latest version of the open-source Apache Hadoop code base, loosely known as Hadoop 2.0, has many valuable enhancements including high availability and federation in the Hadoop Distributed File System (HDFS), and support for alternate programming frameworks in MapReduce. However, I’m a bit disappointed that these enhancements were rolled out without any unifying rationale.
No one has stepped forward to present a coherent vision for Hadoop’s ongoing development. When will development of Hadoop’s various components—MapReduce, HDFS, Pig, Hive, and so on—be substantially complete? What is the reference architecture within which these other Hadoop services are being developed under Apache’s auspices?
Also, nobody has defined where Hadoop fits within the growing menagerie of big data technologies. Where does Hadoop end and various NoSQL technologies begin? What requirements and features, best addressed elsewhere, should be left off the Apache Hadoop community’s development agenda?
And no one has called for a move toward more formal standardization of the various Hadoop technologies within a core reference architecture. The Hadoop industry badly needs standardization to support certification of cross-platform interoperability. This standardization plus the continued convergence of all solution providers on the core Apache Hadoop code base are the only ways to ensure that siloed proprietary implementations don’t stall the industry’s progress toward widespread adoption.
In Part 2 of this article, I will explore the functional areas on which a possible Hadoop standards reference framework should focus.
IBM is named a leader in the first Forrester Wave for data governance tools
Video: See how to use IBM Navigator on Cloud to enhance team productivity
An Inc. magazine article shows how a zoo learned a lot from big data
Access geospatial analytics in the Bluemix Analytics Warehouse service
IBM Institute for Business Value study: Apply advanced analytics to enhance patient outcomes
IBM ebook: Gain insights for the chief data officer role
Big data in a minute: The composable business