For big data bloggers, LinkedIn is a great venue to reflect on relevant topics—but you have to take what you find there with a grain of salt. I don’t want to be overly harsh, but many people who speak with authority on big data topics have never spent any actual time in the space. Most of it is well-intentioned, of course, but some of these bloggers write because they feel threatened by the changes that are underway. The common thread of these posts is the idea that nothing about big data is really new—that it doesn’t require fundamental changes in the ways people, teams, and data architectures have to work together. And that simply isn’t correct.
A good example of this came up in a post from a data analytics professor who published a link to a predictive modeling blueprint for big data—basically a “how to” guide. The blueprint purported to provide a road map for operating a big data predictive analytics capability within an organization. Cool, I thought. I’m game for seeing what others think about how to do that. I clicked through eagerly, hoping I could squirrel away some new insight.
Now, I assume that the professor posted this to help ignite the conversation around big data, which of course is a good thing. But the blueprint was fundamentally broken and the recommendations in the paper were completely off base—and not by just a little.
Basically, the road map suggested that nothing is different about using big data for predictive analytics versus what you currently do with small, traditional data sets. If you removed the words “big data” from the paper, it would look similar to the way we’ve done predictive analytics with traditional data for the last 5 or 10 years. It was linear, tool-centric, silo-oriented, and explicitly said that SAS and R are the only tools needed. The architecture and road map even referred to a single data source and a separate analytics server that was capable of handling every query!
Respectfully, the model presented wasn’t even close to being correct or useful. It assumes that the usual setup of stand-alone, linear, and analyst-driven work separate from the data collection process would be sufficient, which isn’t true. Big data analytics is anything but linear, especially since you are dealing with new or under-instrumented data sources. You must deliberately break existing methods because of the size of your data and its numerous sources.
The model from the blog also completely omitted the role of data curation and notions of latency, and it made no mention of data accuracy and lineage issues. Furthermore, it assumed that existing tools are adequate for the job at hand. In the real world, none of those assumptions is wholly accurate—in fact, big data challenges every single one. I’m not a fan of the hype about data scientists being gods, but there is truth to the idea that a new skill is required to work with big data. Good data scientists do exactly what is missing from the blueprint presented: they mix data sources, push to use larger data sets, and challenge the accepted wisdom that you should wait for results rather than change outcomes.
The road map that was given is a direct course to failure, in my opinion, because it was rooted in defense of the status quo. This is one of those times of great change when you need to leave yourself open to wherever the journey takes you. That doesn’t mean you should abandon everything you know and trust about working with data—but if you are going to get challenged anyway, this is a perfect opportunity to take a fresh look at your assumptions about how things are supposed to be versus how they are. (In my experience, walking into a headwind is better than waiting for it to blow you over.) If you do that and decide everything is fine as is, then great—at least you’ve looked. Just remember that fortune favors the brave.
Now, I’ll be the first to volunteer that I’ve still got a lot to learn about how to work with big data, just like everyone else. But one thing I do know for sure is that the traditional ways of working with data will not lead to success in big data analytics. The variety of information sources, the volume of information, latency of processing, even the basic business models are often all different in the big data space. Anyone who recommends using the same old tools and linear approaches under those circumstances is someone who hasn’t spent time in the space, or wasn’t humble enough to listen to those who have.
What do you think? Do you agree that big data requires fundamentally different ways of working? Let me know in the comments.
Forrester report: Extract business value from social content
IBM white paper: Could your content be working harder—smarter?
And take advantage of open source InfoSphere Streams components
Podcast: Build a business case for real-time analytics
White paper: Deploy Hadoop to gain insights from mainframe data