Big Data: Data Quality’s Best Friend? Part 1

Why we don't see an inherent trade-off between the volume of a data set and the quality of the data maintained within it
James Kobielus coauthored this article.
Special thanks to John McPherson, who contributed to this article.

Many people are under the mistaken impression that there’s an inherent trade-off between the volume of a data set and the quality of the data maintained within it. This concern comes up frequently—and it was a big topic recently on the panel that Tom participated in at the Financial Services Information Sharing and Analysis Center (FS-ISAC), among other places.

According to this way of thinking, you can’t scale out into the petabyte range without filling your Apache Hadoop cluster, massively parallel data warehouse, and other nodes with junk data that is inconsistent, inaccurate, redundant, out of date, or unconformed. But we disagree. Here’s why we think this notion is an oversimplification of what’s really going on.

Big data isn’t the transactional source of most data problems

The cause of data quality problems in most organizations is usually at the source transactional systems—whether that’s your customer relationship management (CRM) system, general ledger application, or something else. These systems are usually in the terabyte range.

During the discussion, Jim rightly noted that any IT administrator who fails to keep the system of record cleansed, current, and consistent has lost the half the battle. Sure, you can fix the issue downstream (to some degree) by aggregating, matching, merging, and cleansing data in intermediary staging databases. But the quality problem has everything to do with inadequate controls at the data’s transactional source, and very little to do with the sheer volume of it.

Downstream from the source of the problem, you can scale your data cleansing operations with a massively parallel deployment of IBM® InfoSphere® QualityStage®—or of IBM BigInsights™ tricked out for this function—but don’t blame the cure for an illness it didn’t cause.

Big data is aggregating new data sources that you haven’t historically needed to cleanse

In traditional warehouse systems the issue of data quality is fairly well understood (if still a challenge) when you are primarily concerned with maintaining the core systems of record—customers, finances, human resources, the supply chain, and so on. But how about in the big data space?

A lot of big data initiatives are for deep analysis of aggregated data sources such as social marketing intelligence, real-time sensor data feeds, data pulled from external resources, browser clickstream sessions, IT system logs, and the like. These sources have historically not been linked to official reference data from transactional systems. Historically, you did not have to cleanse them because they were looked at in isolation by specialist teams that often worked through issues offline and weren’t feeding their results into an official system of record. However, cross-information type analytics—which is common in the big data space—have changed this dynamic.

Although individual data points can be of marginal value in isolation, they can be quite useful when pieced into a larger puzzle. They help provide context for what happened, or what is happening.

Unlike business reference data, these new sources do not provide the sort of data that you would load directly into your enterprise data warehouse, archive offline, or need to keep for e-discovery purposes. Rather, you drill into it to distill key patterns, trends, and root causes, and you would probably purge most of it once it has served its core tactical purpose. This generally takes a fair amount of mining, slicing, and dicing.

Data quality matters in two ways in this situation. First, you can’t lose the source, actors, participants, or actions—and these items need to be defined consistently with the rest of your data. Second, you can’t lose the lineage of how you boiled things down. The who, what, when, where, and how need to be discoverable and reproducible.

As our IBM Research colleague John McPherson says, “Keep in mind that often when we’re talking about big data we are talking about using data that we haven’t been able to exploit well in the past—so we’re typically trying to solve different problems. We’re not trying to figure out the profitability of each of our stores. We should already be doing that using high-quality data from systems of record and doing the things we do to standardize and reshape as we put it into a data warehouse.” What we’re trying to do here, in John’s example, is find out what’s contributing to the profitability for the stores.

This article is continued in Part 2. In the meantime, please let us know about your experiences with big data quality in the comments.

Previous post

Integrating Data Governance and Big Data with Business Processes

Next post

DB2 Native SQL Procedures: The Future of Computing?

Tom Deutsch

Tom Deutsch (Twitter: @thomasdeutsch) is chief technology officer (CTO) for the IBM Industry Solutions Group, and focuses on data science as a service. Tom played a formative role in the transition of Apache Hadoop–based technology from IBM Research to the IBM Software Group, and he continues to be involved with IBM Research's big data activities and the transition from research to commercial products. In addition, he created the IBM® InfoSphere® BigInsights™ Hadoop–based software, and he has spent several years helping customers with Hadoop, InfoSphere BigInsights, and InfoSphere Streams technologies by identifying architecture fit, developing business strategies, and managing early stage projects across more than 200 engagements. Tom came to IBM through the FileNet acquisition, where he had responsibility for FileNet’s flagship content management product and spearheaded FileNet product initiatives with other IBM software segments, including the Lotus and InfoSphere segments. Tom has also worked in the Information Management in the CTO’s office and with a team focused on emerging technology. He helped customers adopt innovative IBM enterprise mash-ups and cloud-based offerings. With more than 20 years of experience in the industry, and as a veteran of two startups, Tom is an expert on the technical, strategic, and business information management issues facing the enterprise today. Most of his work has been on emerging technologies and business challenges, and he brings a strong focus on the cross-functional work required to have early stage projects succeed. Tom has coauthored a book on big data and multiple thought-leadership papers. He earned a bachelor’s degree from Fordham University in New York and an MBA degree from the University of Maryland University College.

  • Diane

    I could not agree more!! It is better with couple of gigs of useful data than TB’s of garbage. There are ways to get the large volumes of data cleansed but ideal is to stop the garbage at the source.


  • Pingback: Quora()

  • Jim Walker


    This is a well done, realistic look at the intersection of big data and data quality. Much of the data that we see in our customers Hadoop implementations is of this sort. Machine generated data is much less likely to be of poor quality than data entered by the sales team in the field. Sales people enter duplicate records? ;)

    In the end, I have always claimed that data quality is a sliding scale. Where you land on that scale is driven by two things… a) your stomach (budget) for dealing with the issues and more importantly, b) the business problem you are trying to solve with the data. If I am just trying to find a needle in a haystack, how high does the quality need to be? If I am analyzing click stream data and joining that with my customer data to gain insight into trends and patterns (or profitability of web stores), the required level of quality is completely different.

    it IS about different problems. Great post!