Consumerizing Big Data
Many in the industry continue to remark how Apache Hadoop is still very much a nascent market. While I believe that sentiment to be true, there does appear to be a lot more movement into an adoption phase, particularly over the past six months. A large part of what is driving this adoption is coming from a desire for cost containment in this era of big data.
Storing data in Hadoop is significantly more cost-effective than storing it in a well-groomed data warehouse. As discussed in my last column, “Relishing the Big Data Burger,” Hadoop is consistently leveraged as both a landing zone and as an active archive to move cold data from the warehouse. These use cases make total sense; they are cost-effective, they help increase performance, and they allow organizations to tap into new data sources.
The second driver after cost containment is the competitive-edge opportunity that big data can offer an organization. In today’s business environment, organizations using analytics can gain real competitive advantage. Recent studies showed a 57 percent increase from 2010 to 2011 in respondents who say analytics create a competitive advantage.1 New data sources can provide a true 360-degree view of the customer that could not be achieved in the past. We can now consume dark data—large volumes of machine data—that provide fresh insight into operational issues and fraud, help manage a brand in real time, and understand customer behavior, likes, and dislikes at a much deeper level than before. In short, Hadoop offers the following compelling advances:
- The opportunity to tap into a wide range of data sources
- Existing analytics enhanced by unstructured data sources
- Analytic ecosystem cost reduction by pre-processing data and storing cold data
- Advanced analytic capability to remain competitive
Follow the curve
As the curve starts to turn toward increased adoption, some interesting things happen that I call the consumerization of Hadoop. Before you think, “there goes another IBMer making up another term,” think about the technology curve of just about anything new that starts to be leveraged. Early adopters are the trailblazers, and in the case of Hadoop both Facebook and Yahoo blazed a challenging trail so that Hadoop can be enterprise-grade. But as can be the case with most new technologies, when the adoption rate climbs and the product becomes increasingly consumable, our understanding of what is really needed to make the technology work in our environments grows. There are three ways in which Hadoop becomes much more consumable: improving data accessibility, overcoming a skills gap, and deploying appliance-based solutions.
Enhancing accessibility of Hadoop-stored data
A key challenge to moving up the adoption curve was a prevailing shortage in skill sets. Many organizations today are growing their Hadoop skills organically because acquisition has been so difficult. We know SQL, and we have people who know SQL, so increasing data accessibility in Hadoop requires a more SQL-like interface. One of the Hadoop trailblazers, Facebook, deserves the credit for starting this revolution with the Hive data warehouse system, which is now an Apache project. Today, many leading-edge vendors, including IBM, are providing this type of access to Hadoop because it just makes sense.
IBM offers Big SQL technology, which is part of the IBM® InfoSphere® BigInsights™ platform version 2.1 and is designed to provide a SQL-like interface for querying data in Hadoop. This interface enables creating and querying tables for data that is stored in Hive, Apache HBase, and BigInsights.
Big SQL supports Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC) client access from Red Hat Enterprise Linux and Microsoft Windows operating systems. Data can be read directly from other relational sources such as IBM DB2® data management, an IBM PureData™ System for Analytics database, and even Teradata enterprise analytic technologies. Big SQL also supports the Hadoop Distributed File System (HDFS) or IBM General Parallel File System (GPFS™) in BigInsights version 2.1.
In addition, Big SQL makes big data highly consumable because it allows resources with SQL skills to access data stored in Hadoop. Leveraging new and unstructured data sources can significantly enhance existing analytics and allow end users to ask big questions of the data.
Addressing the skills gap with education
Many organizations can agree that a primary challenge they face in leveraging all this data is the skills needed to handle it. This skill set requires a combination of data scientists, database administrators (DBAs), and professionals adept with Hadoop and the ecosystem. Many organizations believe the adoption curve of Hadoop has slowed because of the shortage of resources with adequate skills.
I first discovered this challenge when reading a May 2011 report on research by McKinsey Global Institute that looked at big data in five international domains. One of several key insights offered by the research addressed this shortage of skills challenge: “By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”2
If this shortage is going to hit in 2018, that means we have to start now and get to the kids in college, right? But are big data and Hadoop being adopted into university curricula to achieve this goal? The answer to this question is an emphatic “yes.”
IBM is now working with over 1,000 global universities to integrate big data and Hadoop into collegiate-level curricula. A recent August 14, 2013 press announcement highlights the global focus designed to prepare students for the 4.4 million jobs that big data is likely to create in the next few years.3
In the short term, IBM has announced a new services offering aimed at not just providing skills for big data projects but skills transfer. The expected IBM Big Data Stampede services offering focuses on transferring the skills that can allow organizations to be more self-sustaining.
Deploying Hadoop-based appliances
In the same way appliances helped simplify the deployment of data warehouses to enable organizations to focus on business value, the same value proposition holds true for Hadoop. The opportunity data offers is so valuable that prolonging the time required for a release of a product or service is not cost-effective. The shortage of Hadoop skills and the challenge of integration and management diminish the ability to leverage new, untapped data sources. Expanding the consumerization of Hadoop in appliance-based solutions makes a lot of sense to help accelerate the ability of organizations to leverage data effectively.
IBM recently announced the general availability of the IBM PureData® System for Hadoop appliance, the latest member of the IBM PureSystems™ family. PureData System for Hadoop was designed specifically to make enterprise Hadoop more consumable and accelerate the ability to gain insight from big data. This single integrated appliance enables enterprises to be up and running in hours with built-in visualization and analytic accelerators that help achieve insight rapidly.
In addition to accelerating insight, PureData System for Hadoop offers built-in archiving software that helps move cold data from the PureData System for Analytics database to the Hadoop appliance. Archiving is a key use case for Hadoop that helps reduce the cost of storage while optimizing the performance of the data warehouse.
Expand data accessibility
As Hadoop becomes increasingly mainstream in the enterprise, we continue to find ways to accelerate adoption by making the technology highly consumable, which follows a very normal curve for any technology advancement. Remember the first laptops that weighed 40 pounds, or the first mobile phones that resembled a small suitcase? End-user adoption continues to drive fresh ways to increase the consumerization of Hadoop, and education and services should help ease the skills gap. SQL-like interfaces and appliances can similarly enhance the accessibility of Hadoop and ease its implementation.
IBM is helping further these initiatives with leading-edge approaches in all these areas. As of late 2012, the adoption curve spanning several industries—from trailblazers to early and late majority adopters—crossed the 50 percent milestone.4 Just imagine what can happen in 2014.
Please share your thoughts or questions in the comments.
1 “Analytics: The widening divide,” IBM Institute for Business Value, IBM Global Business Services in collaboration with MIT Sloan Management Review, October 2011.
2 “Big data: The next frontier for innovation, competition, and productivity”; Insights and Publications report; McKinsey Global Institute (MGI), the business and economics research arm of McKinsey & Company, and McKinsey’s Business Technology Office; May 2011.
3 “IBM narrows big data skills gap by partnering with more than 1,000 global universities,” IBM news release, IBM, August 14, 2013.
4 “Big data comes of age,” by Dr. Barry Devlin, Shawn Rogers, and John Myers, Enterprise Management Associates (EMA) and 9sight Consulting Research report sponsored by IBM, November 2012.