By Tom Deutsch
By Nancy Kopp
By Paula Wiles Sigmon
By Joe Borges
By Stuart Litel
By Lester Knutsen
By James Kobielus
By Cristian Molaro
By Leon Katsnelson
By Susan Visser
By Bernie Spang
By the DB2 Guys
By Fred Ho
By Louis T. Cherian
By Shweta Shandilya
By Lawrence Weber
By Serge Rielau
By Dwaine Snow

This article is continued from Part 1.
The primary problem with most “general purpose” data management tools is that they must wrap themselves around a functional data model at arm’s length, using the functional model as-is without the appropriate anchors to provide a strong grip on the information. For example, some tables have keys that others do not. Some are unique and some are not.
However, this problem exacerbates rather than clarifies how we will manage all of the data in all of the tables in the same reliable, consistent manner. If we do it with keys on one table and by some other type of value on another—or if we plumb the database logs to discover what has been touched and when—these strategies will never provide a consistent picture of the entire database. More importantly, each nuance introduced by the variability of the tables’ contents only creates more failure points for their management. If only one can slip through the cracks, it surely will.
Objective management is not intrinsically based on the tables’ functional content. Instead, we add operational content to all the tables. This content is the foundation for table management and does not rely on the functional information to maintain their integrity of operation. For example, if your operations consist of a JOB_ID (typically an incremental number associated with each unit of work touching the tables) and this JOB_ID is added to each and every table, you now have the basis for a rollback. If a job touches 1 or 100 tables, it doesn’t matter. We can now track which tables (and which records) were touched. Many sites forego this simple implementation for one reason: their ETL tool doesn’t do it and they would have to manually and painstakingly formulate it into both the application tier and the data model.
A lot of you are already doing something with JOB_ID or an equivalent—but I suggest that you formalize it as part of an operational integrity model. Wrap some logic around it and grow it as an asset. Formulate a logging table that tracks which jobs touched which tables for traceability and auditability.
Some folks implement a JOB_ID and later decide to get rid of it—ostensibly to save space because nobody is using it. Bad plan! The problem is, most of the time a JOB_ID is just a latent artifact, like a buried quantity for troubleshooters. What if we could, at the end of each unit of work, record every table that we touched into a logging table? Then if we need to rollback, we have a one-stop-shop to find all tables and ultimately all records associated with the job. We might use this rarely for production runs, and that’s good. But functional testers absolutely love this kind of capability. They can reset their testing database to a known state and repeat their regression tests for objective closure of their test cases.
Now that we have a logging table telling us exactly what data just arrived, we can leverage it for all kinds of incremental activities. We don’t want to replicate raw data and reprocess it; we want to replicate it in the post-process. But what if we want to replicate that work to another database or server? We need only to reconcile our local log with the target database log, then incrementally transfer only the data that has changed.
What if we need to offload the most recent stuff to archive, to maintain a backup, or transfer it to a DR server in anticipation of hot failover? We now have an objective measure to tell us, at the solution level, what really happened and what we need to reconcile. An external, general-purpose utility has no such objective ability.
In Part 3 of this article, we’ll take a deeper look at the more daunting task of offloading and reloading the monstrous data quantities that PureData/Netezza is so famous for. Where will we store all those Libraries of Congress? And how will we get them back into the machine?
IBM Big Data, Integration and Governance 2013 Forums
Attend an event near you to learn how leading organizations are making sense of massive amounts and new types of information to create value
DB2 TechTalk: Deep Dive on BLU Acceleration in DB2 10.5, Super Analytics Super Easy
Thursday, May 30: 12:30 – 2:00 PM ET
Informix Chat with the Lab: Primary Storage Manager (PSM) a Parallel Backup Alternative to Ontape
Thursday, May 30: 11:30 – 1 PM ET
Big Data Executive Summit
June 7 (Dallas) and June 10 (San Francisco)
Big Data Seminar 2013, Featuring Krish Krishnan
June 14 in New York City
Hadoop Summit North America
June 26-27
Big Data and the Enterprise: A Perspective from Featured Gartner Analyst Donald Feinberg
July 11: 11AM ET
marcus evans Pharma Data Analytics Conference
July 10-11 in Philadelphia
IBM Smarter Content Summit 2013
Register now!
Big Data at the Speed of Business
Broadcast event replay now available
Information on Demand 2013: Early Bird Registration Now Open
November 3-7 in Las Vegas