By Leon Katsnelson
By Susan Visser
By Bernie Spang
By the DB2 Guys
By Fred Ho
By Louis T. Cherian
By Shweta Shandilya
By Lawrence Weber
By Serge Rielau
By Dwaine Snow
In this column, I will examine the world of information management from a slightly different perspective than pure information technologists do. I will focus on the flow of information and how we discover, modify, create, and manage these flows. People, processes, and machines both produce and consume information, and there is typically a natural flow from one or more producers to one or more consumers.
Information doesn’t just flow directly from producer to consumer, though. It is often processed along the way by intermediaries that consume the information, process it in some way, and then provide that information to subsequent consumers. This flow is akin to a supply chain for physical goods. We will explore the supply chain metaphor more deeply in future articles.Figure 1. A simple information supply chain
Just as raw materials are processed into finished goods through a series of processes, we can look at information management systems as being made up of multiple, overlapping supply chains. Raw information enters the supply chain and is then cleansed, transformed, combined, and put through one or more analytical and operational processes designed to deliver insights to the information’s consumers. Supply chain patterns for analytical systems are becoming well known with a range of techniques such as Extract Transform Load (ETL) and data replication to cleanse, combine, and load information into data warehouses and data marts.
Information supply chains pervade all organizations. Sometimes they are easy to identify, but often they are not. When we explicitly use tools such as InfoSphere Information Server, we can look at the flow of information through DataStage jobs using the InfoSphere Metadata Workbench. However, its much more difficult to find (and document) the flow of information through spreadsheets. However, for better or for worse, the movement of spreadsheets represents an information supply chain as well—even though it is often not controlled or managed.
Reference data is information that defines a consistent list of valid values for an attribute. Reference data can express a wide variety of things: lists of countries, regions, offices, bank branches, accounting codes, return authorization reasons, et cetera. These lists of valid values are used in lookup tables to populate drop-down menus in user interfaces, to control the legal set of values in a database column or message attribute, or to provide a classification of something else (for example, medical diagnostic codes that may be associated with a medical record).
|System 1||System 2|
While an organization may standardize on a common definition for a particular kind of reference data, it is often difficult to establish this common definition across all applications and databases. Packaged applications, for instance, may require specialized structures and values for reference data. Applications also often have unique representations designed to improve local processing. For example, sometimes in an effort to improve processing, a numeric value might be used instead of an alphanumeric description. So in the example above, System 1 and System 2 have different representations for the same value.
This means that if we are moving information from System 1 to System 2, we need to translate each instance of “Mr.” from System 1 to “01” in System 2 for the two systems to understand each other. If we wanted to add a new honorific such as “Sir” to the list in System 1, we would also need to add the corresponding value in System 2—or System 2 won’t be able to understand what System 1 is talking about.
While this may seem like a trivial example, these kinds of code tables are pervasive throughout most organizations. Often they are a bit more complex and can include language translations and other properties. Mappings are not always one-to-one, and there can be dozens of different systems that contain different representations of the same reference data. When reference data is well managed, the information supply chains that depend on these kinds of value translations work well. If a problem occurs, then the supply chain can break—or, even worse, provide inaccurate results.
Managing reference data can be broken down into three key steps:
Today, many organizations perform these steps through manual operations. However, due to the increasing complexity of managing these sets, as well as emerging regulations in industries such as banking and insurance, tools that specialize in reference data management are becoming increasingly important. InfoSphere Master Data Management Reference Data Management (RDM) Hub is one such tool that we have developed through working with a number of customers.
RDM Hub plays two distinct roles in managing information supply chains. First, RDM Hub enables smooth communications between disparate systems that depend on a common understanding of reference data. Second, the management of reference data itself is implemented through supply chains where reference data is fed to, authored, and approved in the RDM Hub. Reference data sets are mapped and reference data values and maps are distributed for use by downstream systems. Managing the reference data supply chain can be critical to the success of other supply chains.
Existing information supply chains often need to evolve to address emerging business requirements. For example, analytics of all kinds are becoming increasingly important throughout the enterprise. Whether to better understand customer behavior, ascertain business risk, or improve patient diagnosis, we are looking for more ways to use more information to improve outcomes. Analytics tools require the right data to produce good results. We create new supply chains or extend existing ones to provide the right data at the right time so analytics tools can transform information into insight—and then we often extend supply chains further to distribute that insight to downstream users and applications.
The supply chain metaphor is useful for understanding today’s IT systems so that we can extend them to meet future needs. RDM offers an example of how successful management of supply chains can be key to the success of the broader business. It can also be useful in helping us to identify opportunities for optimization. We can try to understand the overall time it takes for data to move through the supply chain, how much data is moving, and the freshness level of the information. Perhaps we can find supply chains that are no longer needed—for example, applications that are being retired or maybe have already been retired but the databases remain, unchanged.
In my next column, I will dive into the notion of an information supply chain with a deeper discussion of supply chain patterns. I will also introduce the work that we are documenting and describing for easy reuse across many information management problems.
DB2 TechTalk: Deep Dive on BLU Acceleration in DB2 10.5, Super Analytics Super Easy
Thursday, May 30: 12:30 – 2:00 PM ET
Big Data Seminar 2013, Featuring Krish Krishnan
June 14 in New York City
marcus evans Pharma Data Analytics Conference
July 10-11 in Philadelphia
IBM Smarter Content Summit 2013
Big Data at the Speed of Business
Broadcast event replay now available
Information on Demand 2013: Early Bird Registration Now Open
November 3-7 in Las Vegas