OK, it has taken me a while—but as promised, we are going to revisit Fit for Purpose architectures today. It has been encouraging to see how quickly customers have picked up the Fit for Purpose concept and how many are now thinking in terms of matching the compute problem to the best underlying compute paradigm.
If you aren’t using this idea yet in your organization, I’d strongly encourage you to try it as a way of explaining how things will be different going forward. Engaging with your peers using Fit for Purpose as a model can be a powerful way to stimulate new ways of thinking and map out strategies for engaging with new technologies. However, some of the conversations I had at our annual Information On Demand (IOD) conference a few weeks ago made it clear that we still have work to do in this area.
A number of customers at IOD asked how they would recognize a big data use case that demands or invokes a Fit for Purpose framework. That made me realize it would probably be a good idea to pass along some pragmatic guidance on approaching this for the first time and starting to think about what fits where. Not surprisingly, there wasn’t much doubt about where existing relational technologies fit, or even what to do with appliances like the IBM® PureData™ System for Analytics (powered by IBM Netezza®). Instead, the questions were all about big data and when to arrive at a Fit for Purpose approach. More specifically, people were trying to use three data attributes to determine whether a Fit for Purpose approach was needed:
How much data do you need before something becomes a big data problem? It is tempting to think about the size of a data collection being the key (or sole) deciding factor for whether you have a big data project on your hands, but there is no magic terabyte number at which something becomes a big data use case. Sure, 1 PB feels like big data—but most customer scenarios involve hundreds of terabytes. So is it the 101st terabyte that makes it a big data use case? The 501st terabyte?
Your organization may have some quantifiable definition of how much data volume qualifies as big data, but I would argue that you shouldn’t. The reason to avoid this sort of thinking is that data volume by itself is not very useful in determining whether you need a Fit for Purpose solution. By now, I’m sure you’ve heard me use the example of an email analytics project we did where the data volume was small but the computing power required for accurate analysis (remember, you are competing for accuracy here, not just speed) was huge. If you looked solely at data size, you would have missed the boat. Restricting yourself to data size–based definitions of big data also omits other, more useful ways of asking the question.
I’ll be blunt: I think using data type as a deciding factor misses the point. To be fair, there are some archetypal data types that are closely associated with big data—namely log files. So why shouldn’t data type be the only consideration?
There are a few reasons. Big data technologies don’t care much about data type or format. While log files are associated with big data, they are far from the only data type these technologies can handle. And what really gets me going on this issue are vendors that have a simplistic model where all unstructured data goes into Hadoop and all structured data goes into the warehouse. (Hint: If vendors tell you this, they’re just trying to sell you more Exadata.)
Data type by itself is not an adequate determinant for choosing a particular architecture primarily because it fails to account for the idea that organizations typically work with the same data in multiple ways. Data can—and should—be used both in real time and as historical reference. It is equally viable to handle structured data in a traditional enterprise data warehouse as it is to handle it in IBM BigInsights™—and similarly, you can get another insight from looking at data as it flies by in IBM InfoSphere™ Streams or as part of an IBM PureSystems™ (powered by Netezza) analytics datamart. Instead of letting data type drive the decision about where to process and store the data, focus on what you have to do with the data at a given time. Let that determine the architecture.
I actually like this selection factor, and I think it is an underutilized way of determining which combination of architecture you should use for a given problem. In cases where you need to do something that’s awkward or impossible to do with SQL, using a programming model instead of a query model can make all the difference. (That doesn’t mean you won’t use a SQL construct at some point, however.) Similarly, you may have an extreme time constraint or other processing consideration that drives the choice to use a specific architecture. Once that timing issue is resolved, you can consider more conventional approaches.
The basic idea here is to use the flexibility that a Fit for Purpose approach provides to bend the method around the data, rather than trying to beat the job into shape using a fixed (SQL-only) approach. If this “bendy” approach helps you solve the problem, then you should be thinking about a Fit for Purpose framework. Job method isn’t the only thing to look at, but it is an important factor—and it is a situational consideration that lends itself to thinking about a given task rather than a hard decision.
So, using only data size or data type isn’t going to get you where you need to be. Instead, try figuring out how the data “wants” be worked with by looking at the job method as well as the need for flexibility and scale-out options.
Do you agree with my thoughts on how to decide whether you need a Fit for Purpose architecture? In your opinion, which selection factors are best? Let me know what you think in the comments.
IBM big data in a minute: Bringing the power of Hadoop to the enterprise
Video: The right tool for the job
Nature of analytics video: IBM and the swan of all fears
IBM redesigns its Big Data & Analytics website with IBM Watson Foundations capabilities
Visit a website with comprehensive resources dedicated to the chief data officer role
Podcast: Learn about the InfoSphere Streams project at GitHub