Only a decade ago, 10 million records would have been considered a large volume of data. Today, the amount of data stored by enterprises is often in the petabyte or even exabyte range. This explosion is not limited to structured data—in fact, most of the added volume comes from unstructured sources, such as email, images, and documents. Companies use data warehouses to manage these large data volumes, providing users with fast access to important information to help them gain insight and drive innovation. But a data warehouse’s value depends on the completeness, accuracy, timeliness, and understanding of the data that is put into it.
Data with abundant errors, excessive duplication, too many missing values, or conflicting definitions leads to cost overruns, missed deadlines, and most important, users who do not trust the information they are provided. According to a report from Ventana Research, only 3 in 10 organizations view their data as always reliable. More than two-thirds (69 percent) of organizations spend more time preparing data for use than actually using it. When an organization doesn’t trust the data in its warehouse, different parts of the company may act independently and create their own projects to get the information they need—diminishing the value and return on investment (ROI) of the warehouse. According to a Forbes study cited by Bloor Research, “data-related problems cost the majority of companies more than $5 million annually. One-fifth estimate losses in excess of $20 million per year.”
An organization can have hundreds or even thousands of different systems. Information can come from numerous places—such as transactions, document repositories, and external information sources—and in many formats, including structured data, unstructured content, and streaming data.
An organization must be able to manage its supply chain of information, and then integrate and analyze it to make business decisions (see Figure 1). Unlike a traditional supply chain, an information supply chain has a many-to-many relationship. For example, data about the same person can come from multiple places as the person may be a customer, an employee, and a partner—and that information can end up in various reports and applications. Given this complexity, integrating information, ensuring its quality, and interpreting it correctly are crucial tasks that enable organizations to use the information for making effective business decisions. The underlying systems must be cost-effective and easy to maintain, and they must perform well for the workloads they need to handle, even as information continues to grow at exponential rates.Figure 1: The information supply chain.
The success of a data warehouse hinges upon robust data quality. Organizations realize the greatest value when they can leverage data quality software that provides end-to-end data quality capabilities that enables them to act on their data in the following ways.
Having a common business language is critical for aligning technology with business goals. In addition to a controlled vocabulary, the hierarchy and classification systems provide needed business context.
For most organizations, data discovery is a manual, error-prone process requiring months of human involvement to discover business objects, sensitive data, cross-source data relationships, and transformation logic. Organizations need an automated data discovery process that addresses single-source profiling, analysis of cross-source data overlap, discovery of matching keys, automated transformation, and prototyping and testing for data consolidation.
The risk of proliferating incorrect or inaccurate data can be reduced by using rules-driven analysis. Rules analysis is a key data assessment capability that extends the ability to compare, evaluate, analyze, and monitor expected data quality. It consists of rules that evaluate data through focused and targeted testing of that data against user-defined conditions.
Organizations need to create and maintain an accurate view of master data entities, such as customers, vendors, locations, and products. A complete data cleansing solution includes data standardization, record matching, data enrichment, and record survivorship.
A centralized and holistic view across the entire landscape of data quality processes, with visibility into data transformations that operate inside and outside of data quality and data integration systems, arms organizations with critical information that can lead to sound decisions.
For IBM customers, the IBM® InfoSphere® Information Server data quality suite is a fully integrated software platform that provides all of the capabilities outlined above. It facilitates the collaboration needed to develop and support a data warehouse, helping organizations to maximize their technology ROI.
Join the discussion! How have data quality issues impacted your data warehouse? We want to hear from you!
Forrester report: Extract business value from social content
IBM white paper: Could your content be working harder—smarter?
And take advantage of open source InfoSphere Streams components
Podcast: Build a business case for real-time analytics
White paper: Deploy Hadoop to gain insights from mainframe data
Big data in a minute: Lighten the big data load