If you’re aggregating data sets into your Apache Hadoop cluster that have never coexisted in the same database before, and if you’re trying to build a common view across them, you may be in for a rude awakening. It’s not uncommon to find quality issues when you start working with information sources that have historically been underutilized.
When looking at underutilized data, quality issues can become a rat’s nest of nasty discoveries, so it pays to expect the unexpected. A couple of years ago, for example, we did a predictive analytics project on complex systems availability and found that the system data provided as a reference was highly variable and not as described in the spec. The “standard” was really more of a “suggestion.” In cases like this, you either need to go back and deal with the core system data generation or work past the quality issues. This is a fairly common occurrence since, by definition, when you are dealing with underutilized information sources, this may be the first time they have been put to rigorous use.
This issue rises to a new level of complexity when you’re combining structured data with a fresh tsunami of unstructured sources that—it almost goes without saying—are rarely managed as official systems of record. In fact, when dealing with unstructured information (which is the most important new source of big data), expect the data to be fuzzy, inconsistent, and noisy. A growing range of big data sources provide non-transactional data—event, geospatial, behavioral, clickstream, social, sensor, and so on—that is fuzzy and noisy by its very nature. Establishing an office standard and shared method for processing this data through a single system is a very good idea.
When you talk about big data, you’re usually talking about more volume, more velocity, and more variety. Of course, that means you’re also likely to see more low-quality data records than in smaller data sets.
But that’s simply a matter of the greater scale of big data sets, rather than a higher incidence of quality problems. While it is true that a 1 percent data fidelity issue is numerically and administratively far worse at 1 billion samples as opposed to 1 million, the overall rate remains the same and its impact on the resulting analytics is consistent. Under such circumstances, dealing with the data cleanup may require more effort—but as we noted earlier, that’s exactly the sort of workload scaling where big data platforms excel.
Interestingly, big data is ideally suited to resolve one of the data quality issues that has long bedeviled the statistical analysts of the world: the traditional need to build models on training samples rather than on the entire population of data records. This idea is important but underappreciated. The scalability constraints of analytic data platforms have historically forced modelers to give up granularity in the data set in order to speed up model building, execution, and scoring. Not having the complete data population at your disposal means that you may completely overlook outlier records and, as a result, risk skewing your analysis only to the records that survived the cut.
This isn’t a data quality problem (the data in the source and in the sample may be perfectly accurate and up to date) as much as a loss of data resolution downstream when you blithely filter out the sparse/outlier records. However, the effect can be the same. Put simply, the noise in the whole data set is less of a risk than distortion or compressed/artificial results from an incorrect or constrained sample. We’re not saying that sampling is a bad thing—but generally, when you have the option of removing the constraints that prevent you from using all the data, you should do it.
We’re also not saying all of this is easy. Let’s look at a specific customer example in the messy social listening space. It’s easy to manage noisy or bad data when you are dealing with general discussion about a topic. The volume of activity here usually takes care of outliers, and you are—by definition—listening to customers. Data comes from many sources so you can probably trust (but verify through sensitivity analysis) that missing or bad data won’t cause a misinterpretation of what people mean. However, when you examine what a particular customer is saying and then decide how you should respond to that individual, missing or bad data becomes much more problematic. It may or may not be terminal in that analytics run, but it inherently presents more of a challenge. You need to know the impact of getting it wrong and design accordingly. Look for more on this topic in future columns.
Big data can be data quality’s best friend—or at least an innocent bystander on quality issues that originate elsewhere. Do you agree? Let us know in the comments.
Forrester report: Extract business value from social content
IBM white paper: Could your content be working harder—smarter?
And take advantage of open source InfoSphere Streams components
Podcast: Build a business case for real-time analytics
White paper: Deploy Hadoop to gain insights from mainframe data
Big data in a minute: Lighten the big data load