Large-Scale Referential Integrity for Analytics: Part 2
The first part of this article examines scalable capture of millions and even billions of referential errors. The power available in an IBM® PureData® System for Analytics machine to use set-based techniques for finding and capturing referential disconnections utterly eclipses and completely simplifies the equivalent operations required for any other technology. More importantly, the billions-of-rows scale of data stored and processed by a PureData System for Analytics machine requires a much more scalable solution.
This installment discusses scalable reconciliation. The primary problem we face now is the need for a scalable way to reconcile the referential errors, because with IBM PureData System for Analytics platform–level data streams, with error counts potentially in the millions or billions of records, eyeballs aren’t good enough anymore.
For example, consider the simple case of loading four million records to a fact table containing billions of rows. Of these four million records, twenty thousand contain a foreign key value that does not exist in the corresponding primary key (dimension) table containing three million rows. We now have twenty thousand erroneous records, all with the same error. The records aren’t themselves in error; they simply lack the foreign key connection to the primary table. If we could fix one and only one missing row in the dimensional table, the fact table would have no errors at all.
An administrator or operator cannot make sense of twenty thousand errors. If this condition continues daily—creating hundreds of thousands of errors before the problem is fixed—how can any administrator or operator reconcile this condition? We need a way to remediate the errors automatically, and in scale.
The records can fix themselves only if the given primary key arrives. How do we manage the bad records in the meantime? One shop set aside error tables that were physical replicas of the targets. These tables contained all error records to keep them out of the primary tables. This scenario is easily understood but not logistically scalable. That is, managing it is too difficult. If we needed to make changes to the database, we would have to make them in two places—the primary table and the error table—and perform all the necessary data repair as part of the update. We would also have to track data in two separate locations. This approach creates havoc when attempting to write consistent cross-checking queries and other maintenance operations.
The most effective way1 to solve this problem is simply to add an operational column to each record—we’ll call it current_ind and make it a byte in size. We will then require all processes accessing the table to use where current_ind = 0 to make sure they get only valid records. If this value is nonzero, it means the record is logically removed from processing. Keep in mind that we’ll also want a unique ID column—bigint data type—that will specifically locate the record in the table.
Now when we find the bad records, we’ll mark them in the current_ind field with a nonzero value denoting the error corresponding to Bad foreign key or the like. In the meantime, we’ll also record the bad row’s location and context in a tracking table2 called EXCEPTION_FACT. This table will track the table names of the primary and foreign key tables, and the keys that did not match—both by column and by value. It will include the original record identifier so that we can get back to the specific record in our foreign key table. As for the foreign key table, the bad records will physically exist alongside valid records; but with the current_ind containing a nonzero record, it logically removes the record from view. With these records in place, the good record and bad records both enter the target table. Bad records are logically invisible, and we don’t need a separate place for them.
From here, we could allow these records to arbitrarily accumulate and let someone else deal with the problem. But we have another option. Now, each time a primary key table is loaded, we’ll compare the incoming primary key data to the EXCEPTION_FACT. Do we have any records in the EXCEPTION_FACT—for this primary table and this column—that are awaiting this specific value as a foreign key? Perhaps we do. We might have over one hundred thousand of them. Joining those foreign key records to their corresponding EXCEPTION_FACT entry, zapping their current_ind to zero, and then deleting the entries in the EXCEPTION_FACT is simple enough.
In one fell swoop, we repaired the table just as if the primary key had always been there, and we avoided building out any secondary storage location for the error records. We did it automatically as the data arrived using a pattern-based repair protocol that scales, does not involve operator eyeballs, and does not impose upon the valid data in the table.
As I described this automated repair process to my Oracle-guru friend, his eyes popped wide when I suggested joining the EXCEPTION_FACT to the multibillion-row foreign key table with no indexes. “That’s insane,” was his first intelligible comment. On an Oracle machine, yes sir, that is insanity. Oracle cannot do that operation without frosting over.
On a PureData System for Analytics machine, however, this process is just one of many set-based power plays that are simply applied, easily administered, and can keep the data healthy, the processes scalable, and the reconciliations automatic. Now our operators can focus on something more pressing.
Conclusion: Which is better, power without constraints or constraints without power? The choice seems simple, and with PureData System for Analytics, simple works.
When was the last time you had to reconcile a great-big steaming bowl of referential errors using elbows and eyeballs? When attempting to reconcile or record them, could you use a simple, elegant, single-table solution like the EXCEPTION_FACT noted above, or did you see your exception-capture tables start to multiply like rabbits? Did you attempt to make it work in your ETL tool, and did the ETL continue to scale? Or, the 64K question: Did you just turn off the constraints and not worry about it? Shoot me some comments or suggestions on how you have made such a solution work well and continue to scale for ever-larger storage.
1 Compleat Netezza, Copyright © 2012 VMII ISBN: 978-1-4610-9574-3. Excerpted by permission.
2 Netezza Data Integration Framework, Copyright © 2007-2012 Brightlight Consulting, All rights reserved.