Big Data and Warehousing

Going with the Flow

Understanding the information supply chain

In this column, I will examine the world of information management from a slightly different perspective than pure information technologists do. I will focus on the flow of information and how we discover, modify, create, and manage these flows. People, processes, and machines both produce and consume information, and there is typically a natural flow from one or more producers to one or more consumers.

Information doesn’t just flow directly from producer to consumer, though. It is often processed along the way by intermediaries that consume the information, process it in some way, and then provide that information to subsequent consumers. This flow is akin to a supply chain for physical goods. We will explore the supply chain metaphor more deeply in future articles.

Figure 1. A simple information supply chain

 

Just as raw materials are processed into finished goods through a series of processes, we can look at information management systems as being made up of multiple, overlapping supply chains. Raw information enters the supply chain and is then cleansed, transformed, combined, and put through one or more analytical and operational processes designed to deliver insights to the information’s consumers. Supply chain patterns for analytical systems are becoming well known with a range of techniques such as Extract Transform Load (ETL) and data replication to cleanse, combine, and load information into data warehouses and data marts.

Information supply chains pervade all organizations. Sometimes they are easy to identify, but often they are not. When we explicitly use tools such as InfoSphere Information Server, we can look at the flow of information through DataStage jobs using the InfoSphere Metadata Workbench. However, its much more difficult to find (and document) the flow of information through spreadsheets. However, for better or for worse, the movement of spreadsheets represents an information supply chain as well—even though it is often not controlled or managed.

 

A simple example: Managing reference data

Reference data is information that defines a consistent list of valid values for an attribute. Reference data can express a wide variety of things: lists of countries, regions, offices, bank branches, accounting codes, return authorization reasons, et cetera. These lists of valid values are used in lookup tables to populate drop-down menus in user interfaces, to control the legal set of values in a database column or message attribute, or to provide a classification of something else (for example, medical diagnostic codes that may be associated with a medical record).

 

System 1 System 2
Mr. 01
Ms. 02
Dr. 03
Prof. 04
Rev. 05

 

While an organization may standardize on a common definition for a particular kind of reference data, it is often difficult to establish this common definition across all applications and databases. Packaged applications, for instance, may require specialized structures and values for reference data. Applications also often have unique representations designed to improve local processing.  For example, sometimes in an effort to improve processing, a numeric value might be used instead of an alphanumeric description. So in the example above, System 1 and System 2 have different representations for the same value.

This means that if we are moving information from System 1 to System 2, we need to translate each instance of “Mr.” from System 1 to “01” in System 2 for the two systems to understand each other. If we wanted to add a new honorific such as “Sir” to the list in System 1, we would also need to add the corresponding value in System 2—or System 2 won’t be able to understand what System 1 is talking about.

While this may seem like a trivial example, these kinds of code tables are pervasive throughout most organizations. Often they are a bit more complex and can include language translations and other properties. Mappings are not always one-to-one, and there can be dozens of different systems that contain different representations of the same reference data. When reference data is well managed, the information supply chains that depend on these kinds of value translations work well. If a problem occurs, then the supply chain can break—or, even worse, provide inaccurate results.

Managing reference data can be broken down into three key steps:

  1. Authoring and approval of each set of reference data, including versioning and indicators to specify when the reference data is valid
  2. Mapping of the sets, where required
  3. Distributing the reference data sets and maps to where they are needed

Today, many organizations perform these steps through manual operations. However, due to the increasing complexity of managing these sets, as well as emerging regulations in industries such as banking and insurance, tools that specialize in reference data management are becoming increasingly important. InfoSphere Master Data Management Reference Data Management (RDM) Hub is one such tool that we have developed through working with a number of customers.

RDM Hub plays two distinct roles in managing information supply chains. First, RDM Hub enables smooth communications between disparate systems that depend on a common understanding of reference data. Second, the management of reference data itself is implemented through supply chains where reference data is fed to, authored, and approved in the RDM Hub. Reference data sets are mapped and reference data values and maps are distributed for use by downstream systems. Managing the reference data supply chain can be critical to the success of other supply chains.

 

Summary

Existing information supply chains often need to evolve to address emerging business requirements. For example, analytics of all kinds are becoming increasingly important throughout the enterprise. Whether to better understand customer behavior, ascertain business risk, or improve patient diagnosis, we are looking for more ways to use more information to improve outcomes. Analytics tools require the right data to produce good results. We create new supply chains or extend existing ones to provide the right data at the right time so analytics tools can transform information into insight—and then we often extend supply chains further to distribute that insight to downstream users and applications.

The supply chain metaphor is useful for understanding today’s IT systems so that we can extend them to meet future needs. RDM offers an example of how successful management of supply chains can be key to the success of the broader business. It can also be useful in helping us to identify opportunities for optimization.  We can try to understand the overall time it takes for data to move through the supply chain, how much data is moving, and the freshness level of the information. Perhaps we can find supply chains that are no longer needed—for example, applications that are being retired or maybe have already been retired but the databases remain, unchanged.

In my next column, I will dive into the notion of an information supply chain with a deeper discussion of supply chain patterns. I will also introduce the work that we are documenting and describing for easy reuse across many information management problems.

Previous post

How to Get Social Media Analytics Wrong

Next post

Netezza Migration Kung-Fu: Part 1

Dan Wolfson

Dan Wolfson is an IBM Distinguished Engineer and the chief architect/CTO for the InfoSphere segment of the IBM Information Management Division of the IBM Software Group. He is responsible for architecture and technical leadership across the rapidly growing areas of InfoSphere including Tools, Information Integration, Master Data Management, Metadata Management, and Industry Models. Dan's previous roles include CTO for Business Integration Software and chief architect for Information Integration Solutions.

Dan has over 30 years of experience in research and commercial distributed computing, covering a broad range of topics including transaction and object-oriented systems, software fault tolerance, messaging, information integration, business integration, metadata management and database systems. He has written numerous papers and is the co-author of “Enterprise Master Data Management: An SOA Approach to Managing Core Business Information.” Dan is a member of the IBM Academy of Technology Leadership Team and an IBM Master Inventor. In 2010, Dan was also recognized by the Association of Computing Machinery (ACM) as an ACM Distinguished Engineer.