Big Data: Data Quality’s Best Friend? Part 1
Many people are under the mistaken impression that there’s an inherent trade-off between the volume of a data set and the quality of the data maintained within it. This concern comes up frequently—and it was a big topic recently on the panel that Tom participated in at the Financial Services Information Sharing and Analysis Center (FS-ISAC), among other places.
According to this way of thinking, you can’t scale out into the petabyte range without filling your Apache Hadoop cluster, massively parallel data warehouse, and other nodes with junk data that is inconsistent, inaccurate, redundant, out of date, or unconformed. But we disagree. Here’s why we think this notion is an oversimplification of what’s really going on.
Big data isn’t the transactional source of most data problems
The cause of data quality problems in most organizations is usually at the source transactional systems—whether that’s your customer relationship management (CRM) system, general ledger application, or something else. These systems are usually in the terabyte range.
During the discussion, Jim rightly noted that any IT administrator who fails to keep the system of record cleansed, current, and consistent has lost the half the battle. Sure, you can fix the issue downstream (to some degree) by aggregating, matching, merging, and cleansing data in intermediary staging databases. But the quality problem has everything to do with inadequate controls at the data’s transactional source, and very little to do with the sheer volume of it.
Downstream from the source of the problem, you can scale your data cleansing operations with a massively parallel deployment of IBM® InfoSphere® QualityStage®—or of IBM BigInsights™ tricked out for this function—but don’t blame the cure for an illness it didn’t cause.
Big data is aggregating new data sources that you haven’t historically needed to cleanse
In traditional warehouse systems the issue of data quality is fairly well understood (if still a challenge) when you are primarily concerned with maintaining the core systems of record—customers, finances, human resources, the supply chain, and so on. But how about in the big data space?
A lot of big data initiatives are for deep analysis of aggregated data sources such as social marketing intelligence, real-time sensor data feeds, data pulled from external resources, browser clickstream sessions, IT system logs, and the like. These sources have historically not been linked to official reference data from transactional systems. Historically, you did not have to cleanse them because they were looked at in isolation by specialist teams that often worked through issues offline and weren’t feeding their results into an official system of record. However, cross-information type analytics—which is common in the big data space—have changed this dynamic.
Although individual data points can be of marginal value in isolation, they can be quite useful when pieced into a larger puzzle. They help provide context for what happened, or what is happening.
Unlike business reference data, these new sources do not provide the sort of data that you would load directly into your enterprise data warehouse, archive offline, or need to keep for e-discovery purposes. Rather, you drill into it to distill key patterns, trends, and root causes, and you would probably purge most of it once it has served its core tactical purpose. This generally takes a fair amount of mining, slicing, and dicing.
Data quality matters in two ways in this situation. First, you can’t lose the source, actors, participants, or actions—and these items need to be defined consistently with the rest of your data. Second, you can’t lose the lineage of how you boiled things down. The who, what, when, where, and how need to be discoverable and reproducible.
As our IBM Research colleague John McPherson says, “Keep in mind that often when we’re talking about big data we are talking about using data that we haven’t been able to exploit well in the past—so we’re typically trying to solve different problems. We’re not trying to figure out the profitability of each of our stores. We should already be doing that using high-quality data from systems of record and doing the things we do to standardize and reshape as we put it into a data warehouse.” What we’re trying to do here, in John’s example, is find out what’s contributing to the profitability for the stores.
This article is continued in Part 2. In the meantime, please let us know about your experiences with big data quality in the comments.