Many data warehouse environments answer the database referential question in the same way—it’s not enforced. That is, the referential constraints are simply disabled in the database because they only create drag on the loading and maintenance processes. Even those data warehouses with the luxury of power to muscle through their capacity issues will disable the constraints when loading the data into the machine.
Without rehashing the obvious, leaving the constraints enabled will radically drag the load process as it devolves into a transaction mode, validating the keys one row at a time. (We know that anything processed one row at a time cannot scale for bulk operations.1) Once the environment has a handle on the data quality, we may disable the constraints forever because once the data scrubbing processes are accurately scrubbing for bad data, they become the de facto gatekeepers of data quality. The constraints cannot fix the data; they can only prevent it from entering the environment.
Our other option—again for a traditional relational database management system (RDBMS)—is to offload the referential keys from the target table into the extract, transform, load (ETL) tool’s domain and cross-check the incoming data with the available keys. People don’t like doing this cross-check because it seems like a lot of work. But its compromises aren’t much better and usually involve disabling the database constraints, loading the data into the target, and then re-enabling the constraints and hoping for the best.
We must then reconcile the errors arising from this option. But look at the cost of failure. We have to back out the data, scrub it again, and attempt reload. This whole process takes time and may take operator intervention to complete. How much time are we willing to sacrifice in case of error, especially when it can be fully automated?
The inside joke here is that the RDBMS will still perform a table scan—or at least an index scan—to find any missing keys. We are then involving the online transaction processing (OLTP) database in a bulk operation, which for an OLTP database is just a series of OLTP operations masked as a bulk operation. Performance is terrible, and it will get regressively worse as the data grows.
Not so in an IBM® PureData System for Analytics (formerly Netezza) machine. Because all operations are set based, we always get bulk power as an architectural priority. Suppose we were to compare one table to another by a given column—primary to foreign key validation—and attempt a WHERE NOT EXISTS operation to find the keys that don’t fit. How long would this operation take on a traditional platform versus PureData System for Analytics? Now answer this question in terms of billions of rows. Because of its massively parallel processing (MPP) configuration and its ability to scan more data in parallel, without indexes at all, PureData System for Analytics is orders of magnitude faster than traditional platforms.
Many years ago, I had a conversation with an Oracle guru who thought he’d latched onto a Netezza Achilles’ heel when he discovered that Netezza did not enforce referential constraints.1 “Aha!” was his virtual cry. “See? Netezza cannot be taken seriously!” But as we continued, he realized not only that Oracle’s constraints were useless, but also that his own environment didn’t even use them.
Constraints are critical for OLTP applications because we want the incoming transactions to encounter strong resistance from inserting poor-quality data. We want this resistance because we cannot control the application layer and must provide some degrees of isolation so that a bad application programmer doesn’t ruin our data quality. This priority does not apply for back-end loading, where we are in complete control of the information contents and processes. We have high-powered ETL tools or data integration frameworks2 that allow us to scrub, cleanse, and shake down any data points that we know won’t pass muster, so we actively enforce the relationships and fix the problems in the application. Constraints are only in our way. Our data stands ready to load, and we want it inside the target table as quickly as possible. Constraints do not serve us now.
So the best practice is to build a staging table and load the data as is, and then cross-compare the incoming data with the existing data to keep bad records out of the target. With PureData System for Analytics, the data load is super fast. The bulk-data comparison is super fast. And because the primary and foreign key pairs are known metadata, we can actually build out some repeating, metadata-driven patterns that allow us to avoid hard coding the referential checks. It is simple to build and simple to test.
Now what about the bad data? Whom do we report it to? How do we get it corrected and back into the target table? In short, we have effectively loaded the data to the target with no operational errors, but we still have a problem with reconciling and reporting bad data.
Fair warning to the uninitiated: PureData System for Analytics can use this process to find errors very quickly. In one location, we migrated the old data warehouse to the Netezza data warehouse and executed this referential check. The resulting error log immediately became the largest table in the entire database, recording over one billion data disconnections from the original data warehouse storage. One may ask: “How could a database with so many disconnections be considered viable?” Perhaps that’s the reason for moving to PureData System for Analytics? Just sayin’!
In Part 2 of this article, we’ll discuss scalable, automatic reconciliation.
1 Compleat Netezza, Copyright © 2012 VMII ISBN: 978-1-4610-9574-3. Excerpted by permission.
2 Netezza Data Integration Framework, Copyright © 2007-2012 Brightlight Consulting, All rights reserved.
IBM big data in a minute: Bringing the power of Hadoop to the enterprise
Video: The right tool for the job
Nature of analytics video: IBM and the swan of all fears
IBM redesigns its Big Data & Analytics website with IBM Watson Foundations capabilities
Visit a website with comprehensive resources dedicated to the chief data officer role
Podcast: Learn about the InfoSphere Streams project at GitHub