By Leon Katsnelson
By Susan Visser
By Bernie Spang
By the DB2 Guys
By Fred Ho
By Louis T. Cherian
By Shweta Shandilya
By Lawrence Weber
By Serge Rielau
By Dwaine Snow
In Part 1 of this article, we looked at how the IBM® Netezza® platform can process very large, complex data using simple SQL insert/select statements.
Many who teeter on the decision to move logic from their ETL machine into Netezza get hung up on a simple question: Can we scale the approach? In short, the SQL insert/select statement has some logistics we must harness in order to rest on a solid approach foundation. Consider this:
Insert into EMPLOYEE (EMP_ID, EMP_FNAME, EMP_LNAME, EMP_START_DATE) select ID, FirstName, LastName, St_Date from OldEmployees;
Seems simple, or perhaps simplistic. How about this?
Insert into EMPLOYEE select ID, FirstName, LastName, St_Date from OldEmployees;
Note that in the preceding query, the insert phrase is implied. Some of you will recognize the very significant danger here. If later we add a column into the middle of the pack, it will offset the other columns and misalign them with the target columns. Some say that at least one of the target columns will break… but will they? I’ve seen cases where misaligned queries ran for days or months before anyone caught them. They just so happened to align with acceptable target data types even though the content was all wrong. We must bring this issue under control.
Now what if this was a fact table with 150 columns? The insert/select clauses immediately become unwieldy and unmanageable. What if we need to add or subtract columns, or update the query to account for a data model change? These conditions functionally freeze the queries, making the solution brittle. Little changes will shatter its operation.
Our best approach is to line up the target columns with the source rules (the elements of the SQL select phrase that will fill the target columns):1
We can extend a template like this to the limits of the database machine. The target columns will always line up and the source/select columns will never offset. Does it sound like we’re signing up for a bit more than just sticking with the ETL tool? Don’t be fooled. The object is to simplify, not complicate the logistics of generating a SQL statement. This structure helps maintain it without a painful impact review. It’s easy enough to scan with simple tools and compare its content with the catalog contents to find, apply, or just report changes. Keep in mind, however, that any time a data model changes, practically all of the columns are pass-through.1 In this case we concentrate on the exceptions, potentially automating over 90 percent of the data model’s reconciliation. Turnaround for a new data model then becomes agile—perhaps very agile.
With this kind of power over query generation, we now have the freedom to manufacture lots of queries without losing logistical control. Can we scale the approach? Yes—and it’s already being done. A more advanced form of this template drives the framework engine,2 and it’s not uncommon to see dozens of such queries execute in a controlled order toward a functional goal, which keeps it simple to maintain and reuse.
A primary logistical difference between MIMD and SIMD: with MIMD we’re forced into using fewer, more complex SQL statements, which are necessarily serialized because they saturate the hardware. SIMD encourages us to use more and simpler SQL statements and run them in parallel because the machine dispatches them so quickly. So SIMD finishes faster, by orders of magnitude, than its MIMD counterpart.
General-purpose MIMD platforms require initial (and perpetual) complexity, but purpose-built SIMD platforms allow us to begin simply and to stay simple forever. And simple means things are easier to maintain, extend, and troubleshoot—creating the foundation for agility throughout the life of the system.
1 Compleat Netezza, Copyright © 2012 VMII ISBN: 978-1-4610-9574-3. Excerpted by permission.
2 Netezza Data Integration Framework, Copyright © 2007-2012 Brightlight Consulting. All rights reserved.
IBM Big Data, Integration and Governance 2013 Forums
Attend an event near you to learn how leading organizations are making sense of massive amounts and new types of information to create value
DB2 TechTalk: Deep Dive on BLU Acceleration in DB2 10.5, Super Analytics Super Easy
Thursday, May 30: 12:30 – 2:00 PM ET
Informix Chat with the Lab: Primary Storage Manager (PSM) a Parallel Backup Alternative to Ontape
Thursday, May 30: 11:30 – 1 PM ET
Big Data Executive Summit
June 7 (Dallas) and June 10 (San Francisco)
Big Data Seminar 2013, Featuring Krish Krishnan
June 14 in New York City
Hadoop Summit North America
marcus evans Pharma Data Analytics Conference
July 10-11 in Philadelphia
IBM Smarter Content Summit 2013
Big Data at the Speed of Business
Broadcast event replay now available
Information on Demand 2013: Early Bird Registration Now Open
November 3-7 in Las Vegas