Big Data and Warehousing, Information Strategy

Getting Started With Fit for Purpose Architectures

How do you know when you need a new approach?

OK, it has taken me a while—but as promised, we are going to revisit Fit for Purpose architectures today. It has been encouraging to see how quickly customers have picked up the Fit for Purpose concept and how many are now thinking in terms of matching the compute problem to the best underlying compute paradigm.

If you aren’t using this idea yet in your organization, I’d strongly encourage you to try it as a way of explaining how things will be different going forward. Engaging with your peers using Fit for Purpose as a model can be a powerful way to stimulate new ways of thinking and map out strategies for engaging with new technologies. However, some of the conversations I had at our annual Information On Demand (IOD) conference a few weeks ago made it clear that we still have work to do in this area.

A number of customers at IOD asked how they would recognize a big data use case that demands or invokes a Fit for Purpose framework. That made me realize it would probably be a good idea to pass along some pragmatic guidance on approaching this for the first time and starting to think about what fits where. Not surprisingly, there wasn’t much doubt about where existing relational technologies fit, or even what to do with appliances like the IBM® PureData™ System for Analytics (powered by IBM Netezza®). Instead, the questions were all about big data and when to arrive at a Fit for Purpose approach. More specifically, people were trying to use three data attributes to determine whether a Fit for Purpose approach was needed:

 

1. Data size

How much data do you need before something becomes a big data problem? It is tempting to think about the size of a data collection being the key (or sole) deciding factor for whether you have a big data project on your hands, but there is no magic terabyte number at which something becomes a big data use case. Sure, 1 PB feels like big data—but most customer scenarios involve hundreds of terabytes. So is it the 101st terabyte that makes it a big data use case? The 501st terabyte?

Your organization may have some quantifiable definition of how much data volume qualifies as big data, but I would argue that you shouldn’t. The reason to avoid this sort of thinking is that data volume by itself is not very useful in determining whether you need a Fit for Purpose solution. By now, I’m sure you’ve heard me use the example of an email analytics project we did where the data volume was small but the computing power required for accurate analysis (remember, you are competing for accuracy here, not just speed) was huge. If you looked solely at data size, you would have missed the boat. Restricting yourself to data size–based definitions of big data also omits other, more useful ways of asking the question.

 

2. Data type

I’ll be blunt: I think using data type as a deciding factor misses the point. To be fair, there are some archetypal data types that are closely associated with big data—namely log files. So why shouldn’t data type be the only consideration?

There are a few reasons. Big data technologies don’t care much about data type or format. While log files are associated with big data, they are far from the only data type these technologies can handle. And what really gets me going on this issue are vendors that have a simplistic model where all unstructured data goes into Hadoop and all structured data goes into the warehouse. (Hint: If vendors tell you this, they’re just trying to sell you more Exadata.)

Data type by itself is not an adequate determinant for choosing a particular architecture primarily because it fails to account for the idea that organizations typically work with the same data in multiple ways. Data can—and should—be used both in real time and as historical reference. It is equally viable to handle structured data in a traditional enterprise data warehouse as it is to handle it in IBM BigInsights™—and similarly, you can get another insight from looking at data as it flies by in IBM InfoSphere™ Streams or as part of an IBM PureSystems™ (powered by Netezza) analytics datamart. Instead of letting data type drive the decision about where to process and store the data, focus on what you have to do with the data at a given time. Let that determine the architecture.

 

3. Job method

I actually like this selection factor, and I think it is an underutilized way of determining which combination of architecture you should use for a given problem. In cases where you need to do something that’s awkward or impossible to do with SQL, using a programming model instead of a query model can make all the difference. (That doesn’t mean you won’t use a SQL construct at some point, however.) Similarly, you may have an extreme time constraint or other processing consideration that drives the choice to use a specific architecture. Once that timing issue is resolved, you can consider more conventional approaches.

The basic idea here is to use the flexibility that a Fit for Purpose approach provides to bend the method around the data, rather than trying to beat the job into shape using a fixed (SQL-only) approach. If this “bendy” approach helps you solve the problem, then you should be thinking about a Fit for Purpose framework. Job method isn’t the only thing to look at, but it is an important factor—and it is a situational consideration that lends itself to thinking about a given task rather than a hard decision.

So, using only data size or data type isn’t going to get you where you need to be. Instead, try figuring out how the data “wants” be worked with by looking at the job method as well as the need for flexibility and scale-out options.

Do you agree with my thoughts on how to decide whether you need a Fit for Purpose architecture? In your opinion, which selection factors are best? Let me know what you think in the comments.
 

 
Previous post

Big Data, Fractal Geometry, and Pervasively Parallel Processing

Next post

Testing the Data Warehouse: Assuring Data Content, Structure and Quality

Tom Deutsch

Tom Deutsch (Twitter: @thomasdeutsch) is chief technology officer (CTO) for the IBM Industry Solutions Group, and focuses on data science as a service. Tom played a formative role in the transition of Apache Hadoop–based technology from IBM Research to the IBM Software Group, and he continues to be involved with IBM Research's big data activities and the transition from research to commercial products. In addition, he created the IBM® InfoSphere® BigInsights™ Hadoop–based software, and he has spent several years helping customers with Hadoop, InfoSphere BigInsights, and InfoSphere Streams technologies by identifying architecture fit, developing business strategies, and managing early stage projects across more than 200 engagements. Tom came to IBM through the FileNet acquisition, where he had responsibility for FileNet’s flagship content management product and spearheaded FileNet product initiatives with other IBM software segments, including the Lotus and InfoSphere segments. Tom has also worked in the Information Management in the CTO’s office and with a team focused on emerging technology. He helped customers adopt innovative IBM enterprise mash-ups and cloud-based offerings. With more than 20 years of experience in the industry, and as a veteran of two startups, Tom is an expert on the technical, strategic, and business information management issues facing the enterprise today. Most of his work has been on emerging technologies and business challenges, and he brings a strong focus on the cross-functional work required to have early stage projects succeed. Tom has coauthored a book on big data and multiple thought-leadership papers. He earned a bachelor’s degree from Fordham University in New York and an MBA degree from the University of Maryland University College.

  • Risto Saplamaev

    Very useful article for companies who need to implement those technologies for data to actually work for them.

  • Vikram

    I agree with most of the points. Big Data is not a single tool but it is a collection of tools which will tackle problems which are difficult with so called traditional tools.

    And it is wrong only to see volume of data as the problem. As you rightly said a small amount of data may be difficult to be processed with traditional tools because of the nature of the problem.