Big Data and Warehousing

Big Data: Fundamentally Different, Not Just Bigger

Why traditional methods of analysis just aren’t sufficient for working with big data

For big data bloggers, LinkedIn is a great venue to reflect on relevant topics—but you have to take what you find there with a grain of salt. I don’t want to be overly harsh, but many people who speak with authority on big data topics have never spent any actual time in the space. Most of it is well-intentioned, of course, but some of these bloggers write because they feel threatened by the changes that are underway. The common thread of these posts is the idea that nothing about big data is really new—that it doesn’t require fundamental changes in the ways people, teams, and data architectures have to work together. And that simply isn’t correct.

A good example of this came up in a post from a data analytics professor who published a link to a predictive modeling blueprint for big data—basically a “how to” guide. The blueprint purported to provide a road map for operating a big data predictive analytics capability within an organization. Cool, I thought. I’m game for seeing what others think about how to do that. I clicked through eagerly, hoping I could squirrel away some new insight.

Now, I assume that the professor posted this to help ignite the conversation around big data, which of course is a good thing. But the blueprint was fundamentally broken and the recommendations in the paper were completely off base—and not by just a little.

Basically, the road map suggested that nothing is different about using big data for predictive analytics versus what you currently do with small, traditional data sets. If you removed the words “big data” from the paper, it would look similar to the way we’ve done predictive analytics with traditional data for the last 5 or 10 years. It was linear, tool-centric, silo-oriented, and explicitly said that SAS and R are the only tools needed. The architecture and road map even referred to a single data source and a separate analytics server that was capable of handling every query!

Respectfully, the model presented wasn’t even close to being correct or useful. It assumes that the usual setup of stand-alone, linear, and analyst-driven work separate from the data collection process would be sufficient, which isn’t true. Big data analytics is anything but linear, especially since you are dealing with new or under-instrumented data sources. You must deliberately break existing methods because of the size of your data and its numerous sources.

The model from the blog also completely omitted the role of data curation and notions of latency, and it made no mention of data accuracy and lineage issues. Furthermore, it assumed that existing tools are adequate for the job at hand. In the real world, none of those assumptions is wholly accurate—in fact, big data challenges every single one. I’m not a fan of the hype about data scientists being gods, but there is truth to the idea that a new skill is required to work with big data. Good data scientists do exactly what is missing from the blueprint presented: they mix data sources, push to use larger data sets, and challenge the accepted wisdom that you should wait for results rather than change outcomes.

The road map that was given is a direct course to failure, in my opinion, because it was rooted in defense of the status quo. This is one of those times of great change when you need to leave yourself open to wherever the journey takes you. That doesn’t mean you should abandon everything you know and trust about working with data—but if you are going to get challenged anyway, this is a perfect opportunity to take a fresh look at your assumptions about how things are supposed to be versus how they are. (In my experience, walking into a headwind is better than waiting for it to blow you over.) If you do that and decide everything is fine as is, then great—at least you’ve looked. Just remember that fortune favors the brave.

Now, I’ll be the first to volunteer that I’ve still got a lot to learn about how to work with big data, just like everyone else. But one thing I do know for sure is that the traditional ways of working with data will not lead to success in big data analytics. The variety of information sources, the volume of information, latency of processing, even the basic business models are often all different in the big data space. Anyone who recommends using the same old tools and linear approaches under those circumstances is someone who hasn’t spent time in the space, or wasn’t humble enough to listen to those who have.

What do you think? Do you agree that big data requires fundamentally different ways of working? Let me know in the comments.
 

 
Previous post

The Preeminence of IBM Informix TimeSeries: Part 2

Next post

Case Study: Qualcomm

Tom Deutsch

Tom Deutsch (Twitter: @thomasdeutsch) serves as a Program Director in IBM’s Big Data Team. He played a formative role in the transition of Hadoop-based technology from IBM Research to IBM Software Group, and he continues to be involved with IBM Research Big Data activities and transition from Research to commercial products. Tom created the IBM BigInsights Hadoop based product, and then has spent several years helping customers with Apache Hadoop, BigInsights and Streams technologies identifying architecture fit, developing business strategies and managing early stage projects across more than 200 customer engagements. Tom has co-authored a Big Data book and multiple thought papers.

Prior to that, Tom worked in the Information Management in the CTO’s office. Tom worked with a team focused on emerging technology and helped customers adopt IBM’s innovative Enterprise Mashups and Cloud offerings. Tom came to IBM through the FileNet acquisition, where he had responsibility for FileNet’s flagship Content Management product and spearheaded FileNet product initiatives with other IBM software segments including the Lotus and InfoSphere segments.

With more than 20 years in the industry, and as a veteran of two startups, Deutsch is an expert on the technical, strategic and business information management issues facing the Enterprise today. Most of Tom’s work has been on emerging technologies and business challenges, and he brings a strong focus on the cross-functional work required to have early stage project succeed.

Deutsch earned a bachelor’s degree from Fordham University in New York and an MBA degree from the University of Maryland University College.

  • http://Www.cimaglobal.com Victor Smart

    This is intriguing for management accountants. Are they and their tools suited for Big Data? And what does Tom mean the business models may be different. I am currently writing a book on Big Data and finance role for Wiley. Interested to hear your views.

    • Tom Deutsch

      Hi Victor, thanks for the note.

      To be fair, I haven’t personally seen an accounting oriented use case. Most of the compute needs there are pretty well served via traditional ERP-oriented systems, most of my time is spent in exploratory or new frontiers. That said we are seeing a lot risk related work where sheer speed and flexibility of compute is the driver, so there is probably an adjacent space here.

      What do you think?

  • Mike Vostrikov

    Thank you for the interesting article, Tom. Maybe when we speak about big data we are used to think of huge amounts of heterogeneous unstructured data, when analytics takes place during chains of map-reduce jobs. This approach significantly differs from traditional and requires bunch of new skills and tools. However in my opinion big data doesn’t always mean fundamentally different ways of working. For example, we can use Hadoop+Hive just for storing huge amounts of data and extracting small data, that we can then process usual ways, using usual analytic tools. Nothing fundamentally new in such a case but large self-healing data-storage cluster.

    • Tom Deutsch

      Hi Mike, thanks for the note. I agree with your point that not all the use case result in a massive data set for end users, but in the example you gave how you end up with that small/familiar data set is indeed different. I’d also cooperatively suggest that if that is “all” you use your Hadoop cluster for we probably need to all get more creative on other un-met needs.

      Thanks for the note, and appreciate you making the valid point.