By Leon Katsnelson
By Susan Visser
By Bernie Spang
By the DB2 Guys
By Fred Ho
By Louis T. Cherian
By Shweta Shandilya
By Lawrence Weber
By Serge Rielau
By Dwaine Snow
As the most widely adopted new big data technology, Hadoop is far too important to modern business strategies to continue without more structured industry governance. The technology’s maturation depends on a coordinated effort to clarify how it will hang together both internally among its growing range of subprojects, and externally with other big data specifications and communities.
The Hadoop market won’t fully mature and may face increasing obstacles to growth and adoption if the industry does not begin soon to converge on a truly standardized core stack. A Hadoop standards framework would provide a grand vision for the evolution of this technology in a broader big-data industry context. Standards are essential so that vendors and users of Hadoop-based solutions can have assured cross-platform interoperability.
Right now, silos reign in the Hadoop world—a situation that is aggravated by the lack of open interoperability standards. Most Hadoop users have built their deployments either on the core Apache Hadoop open-source code or on specific vendors’ distributions of that code base. IBM has adopted the full-core Apache Hadoop open-source stack into the IBM® InfoSphere® BigInsights™ product. However, some vendors have forked the core Apache Hadoop code base within the context of their solution portfolios, developing proprietary extensions, tools, and other components. Some—but not all—Hadoop vendors have contributed some of their code back to the open-source Apache Hadoop community.
Some of this industry divergence from strictly open-source–based Hadoop has been necessary for solution providers to field enterprise-grade offerings in the face of feature gaps in the core Apache Hadoop subprojects. Nevertheless, if the forking and proprietary extensions continue to take place outside a clear set of standard industry reference implementations and interfaces, silo wars may ensue as vendors with market clout try to leverage their first-mover advantage and rally industry support around their proprietary Hadoop implementation.
Let’s look at this from a historical perspective. In the early 2000s, the service-oriented architecture world didn’t begin to mature until industry groups such as OASIS and WS-I stabilized a core group of specs such as WSDL, SOAP, and the like. Today’s big data world badly needs a similar push to standardize the core of Hadoop, its key new open-source–based approach. Hadoop remains squarely in the “de facto standard” camp.
The latest version of the open-source Apache Hadoop code base, loosely known as Hadoop 2.0, has many valuable enhancements including high availability and federation in the Hadoop Distributed File System (HDFS), and support for alternate programming frameworks in MapReduce. However, I’m a bit disappointed that these enhancements were rolled out without any unifying rationale.
No one has stepped forward to present a coherent vision for Hadoop’s ongoing development. When will development of Hadoop’s various components—MapReduce, HDFS, Pig, Hive, and so on—be substantially complete? What is the reference architecture within which these other Hadoop services are being developed under Apache’s auspices?
Also, nobody has defined where Hadoop fits within the growing menagerie of big data technologies. Where does Hadoop end and various NoSQL technologies begin? What requirements and features, best addressed elsewhere, should be left off the Apache Hadoop community’s development agenda?
And no one has called for a move toward more formal standardization of the various Hadoop technologies within a core reference architecture. The Hadoop industry badly needs standardization to support certification of cross-platform interoperability. This standardization plus the continued convergence of all solution providers on the core Apache Hadoop code base are the only ways to ensure that siloed proprietary implementations don’t stall the industry’s progress toward widespread adoption.
In Part 2 of this article, I will explore the functional areas on which a possible Hadoop standards reference framework should focus.
IBM Big Data, Integration and Governance 2013 Forums
Attend an event near you to learn how leading organizations are making sense of massive amounts and new types of information to create value
DB2 TechTalk: Deep Dive on BLU Acceleration in DB2 10.5, Super Analytics Super Easy
Thursday, May 30: 12:30 – 2:00 PM ET
Informix Chat with the Lab: Primary Storage Manager (PSM) a Parallel Backup Alternative to Ontape
Thursday, May 30: 11:30 – 1 PM ET
Big Data Executive Summit
June 7 (Dallas) and June 10 (San Francisco)
Big Data Seminar 2013, Featuring Krish Krishnan
June 14 in New York City
Hadoop Summit North America
marcus evans Pharma Data Analytics Conference
July 10-11 in Philadelphia
IBM Smarter Content Summit 2013
Big Data at the Speed of Business
Broadcast event replay now available
Information on Demand 2013: Early Bird Registration Now Open
November 3-7 in Las Vegas