By Leon Katsnelson
By Susan Visser
By Bernie Spang
By the DB2 Guys
By Fred Ho
By Louis T. Cherian
By Shweta Shandilya
By Lawrence Weber
By Serge Rielau
By Dwaine Snow
Apache Hadoop is fundamental to the next generation of data warehousing. Companies are adopting Hadoop for strategic roles in their current warehousing architectures, such as extract/transform/load (ETL), data staging, and preprocessing of unstructured content. I also see Hadoop as a key technology in next-generation massively parallel data warehouses in the cloud, which will complement today’s warehousing technologies as well as low-latency streaming platforms.
At IBM, we expect Hadoop and data warehousing to tie the knot more completely over the next several years and converge into a new platform paradigm: the Hadoop data warehouse. Hadoop won’t render traditional warehousing architectures obsolete; instead, it will supplement and extend the data warehouse to support a single version of the truth, data governance, and master data management for multi-structured data that exists in at least two of the following formats: structured (such as relational or tabular), semi-structured (including XML-tagged free-text files), and/or unstructured (for example, ASCII and other free-text formats).
In many ways, the data warehouse and Hadoop are already married in spirit, for in a very real sense they share a common underlying architectural approach—both in IBM’s architecture and in the industry at large. The chief features of this shared approach are massively parallel processing, in-database analytics, mixed workload management, and flexible storage layers.
Hadoop is here to stay, and it’s clearly becoming pivotal to users’ and vendors’ big data approaches going forward. The reasons for Hadoop’s momentum include:
Nevertheless, the evolution of a converged Hadoop data warehousing platform won’t happen overnight. It won’t result in Hadoop, in its current form, crowding out any legacy or new approaches to big data. And it won’t come at the expense of hot technologies such as in-memory, columnar, or graph databases. All of these approaches will coexist in the Hadoop data warehouse that will become mainstream in the near future.
The key data warehousing ecosystem deployment role where Hadoop—and NoSQL technologies, generally—is becoming entrenched is in the staging, preprocessing, and ETL tier. In this role, Hadoop handles the three Vs (volume, velocity, and variety) of social, sensor, event, clickstream, RFID, and other new data sources. Likewise, we’re seeing Hadoop become the “sandboxing” platform of choice for data scientists to explore huge, complex data sets and develop sophisticated statistical models for leading-edge big data applications.
One of the exciting things about the emerging Hadoop data warehouse is that as the requisite governance, security, and management tools emerge, it will be suited to applications that require a consolidated, single 360-degree view of the truth about the structured (transactional) and unstructured (social) customer data that drives offer targeting, experience optimization, and other factors in your digital channel strategy. This capability will be the next-generation Hadoop data warehouse’s killer application that neither constituent technology alone is optimized to support.
Your Hadoop data warehouse will be a powerful converged platform. But it won’t necessarily deliver immediate business value unless your vendor can provide out-of-the-box solution accelerators that are suited to the specific big data applications—including social media analytics and real-time infrastructure monitoring—that you’re deploying. In evaluating commercial big data and Hadoop solutions, you should also be considering whether they bundle key solution accelerator elements, especially sample applications, user-defined and standard development toolkits, and industry data models in areas such as banking, insurance, telco, healthcare, and retail.
At IBM, we’re addressing these and other Hadoop data warehouse requirements across our diverse information management solution areas. Stay tuned for further details in the coming months.
DB2 TechTalk: Deep Dive on BLU Acceleration in DB2 10.5, Super Analytics Super Easy
Thursday, May 30: 12:30 – 2:00 PM ET
Big Data Seminar 2013, Featuring Krish Krishnan
June 14 in New York City
marcus evans Pharma Data Analytics Conference
July 10-11 in Philadelphia
IBM Smarter Content Summit 2013
Big Data at the Speed of Business
Broadcast event replay now available
Information on Demand 2013: Early Bird Registration Now Open
November 3-7 in Las Vegas