By Leon Katsnelson
By Susan Visser
By Bernie Spang
By the DB2 Guys
By Fred Ho
By Louis T. Cherian
By Shweta Shandilya
By Lawrence Weber
By Serge Rielau
By Dwaine Snow
Many people are under the mistaken impression that there’s an inherent trade-off between the volume of a data set and the quality of the data maintained within it. This concern comes up frequently—and it was a big topic recently on the panel that Tom participated in at the Financial Services Information Sharing and Analysis Center (FS-ISAC), among other places.
According to this way of thinking, you can’t scale out into the petabyte range without filling your Apache Hadoop cluster, massively parallel data warehouse, and other nodes with junk data that is inconsistent, inaccurate, redundant, out of date, or unconformed. But we disagree. Here’s why we think this notion is an oversimplification of what’s really going on.
The cause of data quality problems in most organizations is usually at the source transactional systems—whether that’s your customer relationship management (CRM) system, general ledger application, or something else. These systems are usually in the terabyte range.
During the discussion, Jim rightly noted that any IT administrator who fails to keep the system of record cleansed, current, and consistent has lost the half the battle. Sure, you can fix the issue downstream (to some degree) by aggregating, matching, merging, and cleansing data in intermediary staging databases. But the quality problem has everything to do with inadequate controls at the data’s transactional source, and very little to do with the sheer volume of it.
Downstream from the source of the problem, you can scale your data cleansing operations with a massively parallel deployment of IBM® InfoSphere® QualityStage®—or of IBM BigInsights™ tricked out for this function—but don’t blame the cure for an illness it didn’t cause.
In traditional warehouse systems the issue of data quality is fairly well understood (if still a challenge) when you are primarily concerned with maintaining the core systems of record—customers, finances, human resources, the supply chain, and so on. But how about in the big data space?
A lot of big data initiatives are for deep analysis of aggregated data sources such as social marketing intelligence, real-time sensor data feeds, data pulled from external resources, browser clickstream sessions, IT system logs, and the like. These sources have historically not been linked to official reference data from transactional systems. Historically, you did not have to cleanse them because they were looked at in isolation by specialist teams that often worked through issues offline and weren’t feeding their results into an official system of record. However, cross-information type analytics—which is common in the big data space—have changed this dynamic.
Although individual data points can be of marginal value in isolation, they can be quite useful when pieced into a larger puzzle. They help provide context for what happened, or what is happening.
Unlike business reference data, these new sources do not provide the sort of data that you would load directly into your enterprise data warehouse, archive offline, or need to keep for e-discovery purposes. Rather, you drill into it to distill key patterns, trends, and root causes, and you would probably purge most of it once it has served its core tactical purpose. This generally takes a fair amount of mining, slicing, and dicing.
Data quality matters in two ways in this situation. First, you can’t lose the source, actors, participants, or actions—and these items need to be defined consistently with the rest of your data. Second, you can’t lose the lineage of how you boiled things down. The who, what, when, where, and how need to be discoverable and reproducible.
As our IBM Research colleague John McPherson says, “Keep in mind that often when we’re talking about big data we are talking about using data that we haven’t been able to exploit well in the past—so we’re typically trying to solve different problems. We’re not trying to figure out the profitability of each of our stores. We should already be doing that using high-quality data from systems of record and doing the things we do to standardize and reshape as we put it into a data warehouse.” What we’re trying to do here, in John’s example, is find out what’s contributing to the profitability for the stores.
This article is continued in Part 2. In the meantime, please let us know about your experiences with big data quality in the comments.
DB2 TechTalk: Deep Dive on BLU Acceleration in DB2 10.5, Super Analytics Super Easy
Thursday, May 30: 12:30 – 2:00 PM ET
Big Data Seminar 2013, Featuring Krish Krishnan
June 14 in New York City
marcus evans Pharma Data Analytics Conference
July 10-11 in Philadelphia
IBM Smarter Content Summit 2013
Big Data at the Speed of Business
Broadcast event replay now available
Information on Demand 2013: Early Bird Registration Now Open
November 3-7 in Las Vegas