Apache Hadoop is fundamental to the next generation of data warehousing. Companies are adopting Hadoop for strategic roles in their current warehousing architectures, such as extract/transform/load (ETL), data staging, and preprocessing of unstructured content. I also see Hadoop as a key technology in next-generation massively parallel data warehouses in the cloud, which will complement today’s warehousing technologies as well as low-latency streaming platforms.
At IBM, we expect Hadoop and data warehousing to tie the knot more completely over the next several years and converge into a new platform paradigm: the Hadoop data warehouse. Hadoop won’t render traditional warehousing architectures obsolete; instead, it will supplement and extend the data warehouse to support a single version of the truth, data governance, and master data management for multi-structured data that exists in at least two of the following formats: structured (such as relational or tabular), semi-structured (including XML-tagged free-text files), and/or unstructured (for example, ASCII and other free-text formats).
In many ways, the data warehouse and Hadoop are already married in spirit, for in a very real sense they share a common underlying architectural approach—both in IBM’s architecture and in the industry at large. The chief features of this shared approach are massively parallel processing, in-database analytics, mixed workload management, and flexible storage layers.
Hadoop is here to stay, and it’s clearly becoming pivotal to users’ and vendors’ big data approaches going forward. The reasons for Hadoop’s momentum include:
Nevertheless, the evolution of a converged Hadoop data warehousing platform won’t happen overnight. It won’t result in Hadoop, in its current form, crowding out any legacy or new approaches to big data. And it won’t come at the expense of hot technologies such as in-memory, columnar, or graph databases. All of these approaches will coexist in the Hadoop data warehouse that will become mainstream in the near future.
The key data warehousing ecosystem deployment role where Hadoop—and NoSQL technologies, generally—is becoming entrenched is in the staging, preprocessing, and ETL tier. In this role, Hadoop handles the three Vs (volume, velocity, and variety) of social, sensor, event, clickstream, RFID, and other new data sources. Likewise, we’re seeing Hadoop become the “sandboxing” platform of choice for data scientists to explore huge, complex data sets and develop sophisticated statistical models for leading-edge big data applications.
One of the exciting things about the emerging Hadoop data warehouse is that as the requisite governance, security, and management tools emerge, it will be suited to applications that require a consolidated, single 360-degree view of the truth about the structured (transactional) and unstructured (social) customer data that drives offer targeting, experience optimization, and other factors in your digital channel strategy. This capability will be the next-generation Hadoop data warehouse’s killer application that neither constituent technology alone is optimized to support.
Your Hadoop data warehouse will be a powerful converged platform. But it won’t necessarily deliver immediate business value unless your vendor can provide out-of-the-box solution accelerators that are suited to the specific big data applications—including social media analytics and real-time infrastructure monitoring—that you’re deploying. In evaluating commercial big data and Hadoop solutions, you should also be considering whether they bundle key solution accelerator elements, especially sample applications, user-defined and standard development toolkits, and industry data models in areas such as banking, insurance, telco, healthcare, and retail.
At IBM, we’re addressing these and other Hadoop data warehouse requirements across our diverse information management solution areas. Stay tuned for further details in the coming months.