This article is continued from Part 2.
I recently sat in on a POC for a big-name data management tool (not IBM!), and I was flummoxed at the things that made the purveyors so excited. They could offload data and allow us to query the offline archives! This is useful for transactional/OLTP solutions, but worthless for data archived from a PureData/Netezza platform. I mean, can you even imagine performing a scanning/analytic query on three (or a dozen) terabytes of offline data? Not a chance. What we really need is a rapid way to get that data back into the machine so we can do a massively parallel query on it.
No can do, the big-name purveyor said—getting the data back into the machine is too painful. Wait… did he mean getting the data back into a transactional machine is painful? Yes—but that’s why we have the Netezza machine, where it’s not painful, and loading a million records per second is normal workload.
Add to this their glee at being able to back up and store data running parallel/multi-threaded extracts. But how do they get data back into the machine? Supposedly, single-threaded inserts that effectively run the reload at a fraction of the speed of the offload.
The catch: there’s no clock ticking when the data is offloading. When you need to restore, though, it’s like living in a fishbowl. The restore shouldn’t take longer than the backup. With the archive tracking table I mentioned in Part 1 of this article, we would have the capacity to offload just-changed data incrementally, then surgically reload exactly what we want, at high speed.
Many of you are aware of the zone-map capability of the Netezza machine. Tracking the high-low values in each disk extent (and now each page on an extent) provides insight to the optimizer on where the data is not, so it can focus its efforts on the remainder. Now imagine we do this for the incrementally offloaded files for our largest tables. We track the high-low value for each column in each of the file chunks we offloaded. And when an ad hoc user wants to pull something back into the machine, they merely designate the range of one or more of these columns. This gives users a way to locate only the files they want to restore, then reload them in multi-threaded form. The user gets what they want without waiting for a monolithic reload (and of course, we can no longer go monolithic with data sizes commonly found in PureData/Netezza).
More importantly, this approach allows us to store the data in native compressed form, so we can now transfer the data on and off the disk drives without any host-level entanglements. At one site leveraging 16x compression, for example, we could transfer data between machines at an effective speed of 6TB per hour with no special engineering. Imagine what we could do if we actually tuned it!
The primary marketplace problem with commodity tools is that they really have no intention of stepping up to these data sizes. One of these products attempted to copy data from one table to another—three billion rows—and the operation appeared to complete just before the product crashed. It seems that it had no place to store the row count of three billion because the product was built on integer variables that cannot store a number larger than two billion!
Not every tool can step into this arena—and not every vendor wants to. For those who are unafraid of PureData/Netezza’s capacity and what it can do, adventure awaits. For those who really need Netezza’s capacity but are embarking on the journey using commodity parts, remember this: In the old days, maps were marked with “There Be Dragons” in the far-flung regions where nobody had trodden. But lots of people have trodden the path you’re about to take. They will freely confess that the dragons are quite real, they don’t sleep, and they are not easily slain. So rent some dragon-slayers for the journey—and don’t get commodity dragon-slayers. They will scream like schoolchildren and run away. Dragons have that effect on people.
Have you had any of these experiences? Tell me about it in the comments!
Forrester report: Extract business value from social content
IBM white paper: Could your content be working harder—smarter?
And take advantage of open source InfoSphere Streams components
Podcast: Build a business case for real-time analytics
White paper: Deploy Hadoop to gain insights from mainframe data
Big data in a minute: Lighten the big data load