Big Data and Warehousing

Big Data: Data Quality’s Best Friend? Part 2

Why we don't see an inherent trade-off between the volume of a data set and the quality of the data maintained within it
James Kobielus coauthored this article.
Special thanks to John McPherson, who contributed to this article.
This article follows Part 1.


Big data allows you to find quality problems in the source data that were previously invisible

If you’re aggregating data sets into your Apache Hadoop cluster that have never coexisted in the same database before, and if you’re trying to build a common view across them, you may be in for a rude awakening. It’s not uncommon to find quality issues when you start working with information sources that have historically been underutilized.

When looking at underutilized data, quality issues can become a rat’s nest of nasty discoveries, so it pays to expect the unexpected. A couple of years ago, for example, we did a predictive analytics project on complex systems availability and found that the system data provided as a reference was highly variable and not as described in the spec. The “standard” was really more of a “suggestion.” In cases like this, you either need to go back and deal with the core system data generation or work past the quality issues. This is a fairly common occurrence since, by definition, when you are dealing with underutilized information sources, this may be the first time they have been put to rigorous use.

This issue rises to a new level of complexity when you’re combining structured data with a fresh tsunami of unstructured sources that—it almost goes without saying—are rarely managed as official systems of record. In fact, when dealing with unstructured information (which is the most important new source of big data), expect the data to be fuzzy, inconsistent, and noisy. A growing range of big data sources provide non-transactional data—event, geospatial, behavioral, clickstream, social, sensor, and so on—that is fuzzy and noisy by its very nature. Establishing an office standard and shared method for processing this data through a single system is a very good idea.


Big data may have more quality issues simply because the data volume is greater

When you talk about big data, you’re usually talking about more volume, more velocity, and more variety. Of course, that means you’re also likely to see more low-quality data records than in smaller data sets.

But that’s simply a matter of the greater scale of big data sets, rather than a higher incidence of quality problems. While it is true that a 1 percent data fidelity issue is numerically and administratively far worse at 1 billion samples as opposed to 1 million, the overall rate remains the same and its impact on the resulting analytics is consistent. Under such circumstances, dealing with the data cleanup may require more effort—but as we noted earlier, that’s exactly the sort of workload scaling where big data platforms excel.

Interestingly, big data is ideally suited to resolve one of the data quality issues that has long bedeviled the statistical analysts of the world: the traditional need to build models on training samples rather than on the entire population of data records. This idea is important but underappreciated. The scalability constraints of analytic data platforms have historically forced modelers to give up granularity in the data set in order to speed up model building, execution, and scoring. Not having the complete data population at your disposal means that you may completely overlook outlier records and, as a result, risk skewing your analysis only to the records that survived the cut.

This isn’t a data quality problem (the data in the source and in the sample may be perfectly accurate and up to date) as much as a loss of data resolution downstream when you blithely filter out the sparse/outlier records. However, the effect can be the same. Put simply, the noise in the whole data set is less of a risk than distortion or compressed/artificial results from an incorrect or constrained sample. We’re not saying that sampling is a bad thing—but generally, when you have the option of removing the constraints that prevent you from using all the data, you should do it.

We’re also not saying all of this is easy. Let’s look at a specific customer example in the messy social listening space. It’s easy to manage noisy or bad data when you are dealing with general discussion about a topic. The volume of activity here usually takes care of outliers, and you are—by definition—listening to customers. Data comes from many sources so you can probably trust (but verify through sensitivity analysis) that missing or bad data won’t cause a misinterpretation of what people mean. However, when you examine what a particular customer is saying and then decide how you should respond to that individual, missing or bad data becomes much more problematic. It may or may not be terminal in that analytics run, but it inherently presents more of a challenge. You need to know the impact of getting it wrong and design accordingly. Look for more on this topic in future columns.

Big data can be data quality’s best friend—or at least an innocent bystander on quality issues that originate elsewhere. Do you agree? Let us know in the comments.

Previous post

Hadoop: Nucleus of the Next-Generation Big Data Warehouse

Next post

Hadoop Cluster Management

Tom Deutsch

Tom Deutsch (Twitter: @thomasdeutsch) is chief technology officer (CTO) for the IBM Industry Solutions Group, and focuses on data science as a service. Tom played a formative role in the transition of Apache Hadoop–based technology from IBM Research to the IBM Software Group, and he continues to be involved with IBM Research's big data activities and the transition from research to commercial products. In addition, he created the IBM® InfoSphere® BigInsights™ Hadoop–based software, and he has spent several years helping customers with Hadoop, InfoSphere BigInsights, and InfoSphere Streams technologies by identifying architecture fit, developing business strategies, and managing early stage projects across more than 200 engagements. Tom came to IBM through the FileNet acquisition, where he had responsibility for FileNet’s flagship content management product and spearheaded FileNet product initiatives with other IBM software segments, including the Lotus and InfoSphere segments. Tom has also worked in the Information Management in the CTO’s office and with a team focused on emerging technology. He helped customers adopt innovative IBM enterprise mash-ups and cloud-based offerings. With more than 20 years of experience in the industry, and as a veteran of two startups, Tom is an expert on the technical, strategic, and business information management issues facing the enterprise today. Most of his work has been on emerging technologies and business challenges, and he brings a strong focus on the cross-functional work required to have early stage projects succeed. Tom has coauthored a book on big data and multiple thought-leadership papers. He earned a bachelor’s degree from Fordham University in New York and an MBA degree from the University of Maryland University College.