Big data is at the heart of many cloud services deployments. As private and public cloud deployments become more prevalent, it will be critical for end-user organizations to have a clear understanding of big data application requirements, tool capabilities, and best practices for implementation.
Every enterprise roadmap for big data in the cloud should include these four key steps.
You must first identify big data applications where cloud approaches have an advantage that other approaches (such as software on commodity hardware or pre-integrated appliances) lack. Scenarios where cloud-based approaches might be suitable for your big data analytics requirements include:
Big data and cloud, each in their separate spheres, are sprawling paradigms. Getting your arms around their intersection—big data in the cloud—involves both understanding the core architectural principles of each approach and identifying the synergies among them within your platform architectural strategies.
Many big data analytics platform providers, including IBM (check out our PureSystems products) have long been focused on bringing cloud-ready architectures into the heart of our offerings. Likewise, cloud platform providers have integrated ever-larger data sets and more advanced analytics into their various offerings (our business analytics offerings, for example). So the synergies and overlaps among the distinct approaches are already baked into their DNA, as it were, and are supported, to varying degrees, in the respective platforms.
But that’s not necessarily the same as doing big data in the cloud as a coherent architectural approach. A more unified framework would need to functionally align the service layers of a big data analytics reference framework (i.e., data, metadata, models, rules, and so on) with the various layers (application as a service, infrastructure as service, and platform as a service, and ) of a full-blown cloud computing framework.
Big data in the cloud has so many potential functional service layers sprawling across so many nodes, clusters, and tiers that it’s easy to feel overwhelmed. Where do you start, and how do you architect it all according to a coherent reference model?
Big data is increasingly living inside comprehensive cloud architectures. Big data clouds are increasingly spanning federated private and public deployments, encompassing at-rest and in-motion data, incorporating a growing footprint of in-memory and flash storage, and available on demand from all applications.
Smarter big data consolidation requires that you preserve—and even expand, where necessary—the distributed, multi-tier, heterogeneous, and agile nature of your big data environment by implementing a virtualization capability in middleware, in the access layer, and in the management infrastructure. Virtualization provides a unified interface to disparate resources that allows you to change, scale, and evolve the back end without disrupting interoperability with tools and applications.
From an architectural standpoint, key enablers of big data virtualization include abstraction layers for query optimization and semantic interoperability. These factors enable simplified access to the disparate schemas of the RDBMS, Hadoop, NoSQL, columnar, and other data management platforms that constitute a logically unified data/analytic resource. The IBM BigSQL initiative is an example of a query virtualization layer for big data, whereas IBM offers various semantic interoperability tools in Infosphere software and other middleware portfolios.
Big data in the cloud is a complex, tricky thing to manage as a unified business resource. It demands unified governance and security. The more complex and heterogeneous your big data cloud, the more difficult it is to crack the whip of tight control.
You can govern petabytes of data in a coherent manner. There is no inherent trade-off between the volume of the data set and the quality of the data maintained within. The source of data quality problems in most organizations is usually the source transactional systems—whether those are your customer relationship management system, general ledger application, or something else. These systems are usually in the terabytes range.
Just as important is governance of MapReduce and other big data analytic models that execute in your cloud. Big data applications ride on a never-ending stream of new statistical, predictive, segmentation, behavioral, and other advanced analytic models. As you ramp up your data scientist teams and give them more powerful modeling tools, you will soon be swamped with models. Big data analytics demands governance—and, let’s face it, some level of repeatable bureaucracy—if it’s designed to produce artifacts that will be deployed into production applications.
If you’re already doing any of this, the strategic question on cloud-based big data is not about where you start. As cloud-based big data services mature and continue to improve in price/performance, scalability, agility, and manageability, the real question will be where do you stop. By the end of this decade, as an increasing number of applications and data move to the public cloud, the idea of building and running your own big data deployment may seem as impractical as designing your own servers today.
If your organization has started using big data in the cloud, what have you learned along the way? And if you haven’t yet begun a cloud project, what questions do you have about the process? Let me know in the comments.
IBM big data in a minute: Bringing the power of Hadoop to the enterprise
Video: The right tool for the job
Nature of analytics video: IBM and the swan of all fears
IBM redesigns its Big Data & Analytics website with IBM Watson Foundations capabilities
Visit a website with comprehensive resources dedicated to the chief data officer role
Podcast: Learn about the InfoSphere Streams project at GitHub