Large-Scale Data Management in PureData/Netezza: Part 2
This article is continued from Part 1.
The primary problem with most “general purpose” data management tools is that they must wrap themselves around a functional data model at arm’s length, using the functional model as-is without the appropriate anchors to provide a strong grip on the information. For example, some tables have keys that others do not. Some are unique and some are not.
However, this problem exacerbates rather than clarifies how we will manage all of the data in all of the tables in the same reliable, consistent manner. If we do it with keys on one table and by some other type of value on another—or if we plumb the database logs to discover what has been touched and when—these strategies will never provide a consistent picture of the entire database. More importantly, each nuance introduced by the variability of the tables’ contents only creates more failure points for their management. If only one can slip through the cracks, it surely will.
Objective management is not intrinsically based on the tables’ functional content. Instead, we add operational content to all the tables. This content is the foundation for table management and does not rely on the functional information to maintain their integrity of operation. For example, if your operations consist of a JOB_ID (typically an incremental number associated with each unit of work touching the tables) and this JOB_ID is added to each and every table, you now have the basis for a rollback. If a job touches 1 or 100 tables, it doesn’t matter. We can now track which tables (and which records) were touched. Many sites forego this simple implementation for one reason: their ETL tool doesn’t do it and they would have to manually and painstakingly formulate it into both the application tier and the data model.
A lot of you are already doing something with JOB_ID or an equivalent—but I suggest that you formalize it as part of an operational integrity model. Wrap some logic around it and grow it as an asset. Formulate a logging table that tracks which jobs touched which tables for traceability and auditability.
Some folks implement a JOB_ID and later decide to get rid of it—ostensibly to save space because nobody is using it. Bad plan! The problem is, most of the time a JOB_ID is just a latent artifact, like a buried quantity for troubleshooters. What if we could, at the end of each unit of work, record every table that we touched into a logging table? Then if we need to rollback, we have a one-stop-shop to find all tables and ultimately all records associated with the job. We might use this rarely for production runs, and that’s good. But functional testers absolutely love this kind of capability. They can reset their testing database to a known state and repeat their regression tests for objective closure of their test cases.
Now that we have a logging table telling us exactly what data just arrived, we can leverage it for all kinds of incremental activities. We don’t want to replicate raw data and reprocess it; we want to replicate it in the post-process. But what if we want to replicate that work to another database or server? We need only to reconcile our local log with the target database log, then incrementally transfer only the data that has changed.
What if we need to offload the most recent stuff to archive, to maintain a backup, or transfer it to a DR server in anticipation of hot failover? We now have an objective measure to tell us, at the solution level, what really happened and what we need to reconcile. An external, general-purpose utility has no such objective ability.
In Part 3 of this article, we take a deeper look at the more daunting task of offloading and reloading the monstrous data quantities that PureData/Netezza is so famous for. Where will we store all those Libraries of Congress? And how will we get them back into the machine?