By Leon Katsnelson
By Susan Visser
By Bernie Spang
By the DB2 Guys
By Fred Ho
By Louis T. Cherian
By Shweta Shandilya
By Lawrence Weber
By Serge Rielau
By Dwaine Snow
Think about what happens when your brand-spankin’-new PureData/Netezza machine arrives: you can’t wait to break the plastic and get cracking. A few months down the road, after production rollout, you notice that the machine’s data is growing. That’s the way of big data. First it’s a few terabytes, then ten, then 20—and before you know it, you have some real data volume on your hands!
For all of us who have been in this situation, it behooves us to ask ourselves a few obvious questions before we reach that point. Not that we might need more storage, but now we need to manage the data in a very consistent and predictable way. For warehouses of this size, we don’t want to repeat any processing work—we would rather replicate the results. And of course, if we intend to use another PureData/Netezza machine as the DR or backup machine, we need bi-directional replication as well.
We also need to pump the results to a DR server as-they-are-processed in case of hardware failure in high availability scenarios. And when the primary machine fails, we need to switch over to the secondary immediately. When we get the primary back online, we need the secondary to start pumping data back in order to re-sync the machines (including their sequence generators and logs). Can your replicator do that—for tens of terabytes?
Let’s not forget that in data processing, bad data is a fact of life. But in this scenario, it will exist at a very large scale. Let’s say you crank up that ETL tool and load four dimension tables and a fact table. All of the data, except for one of the dimensions, arrives safely. Now what do you do? Do you fix the dimension and reload? If you cannot fix the dimension, do you back out the other tables? What if you just, in true ELT fashion, loaded 20 or 30 tables, or even 100 or more? How do you back out the data you just loaded? Or for that matter, what happens if someone comes to you on Wednesday and says, “The data you loaded on Monday is junk. Back it out.”
Easier said than done. You need a rollback mechanism. Can your ETL tool do that?
Similarly, you need backups. Wait—did you remember to use the compression features? 4x compression makes the queries run 4x faster (no kidding). So does 16x compression makes them run 16x faster? Yes. The problem is that if we have a 40 TB database and want to archive it in a general-purpose tool, we have to make room for 4x that volume uncompressed. That’s 160 TB of raw storage.
We could use the onboard backup utility to offload the data compressed. Just one problem, though: While the backup utility may perform incremental backups at high speed, it won’t restore at high speed. Alas, a rapid, multi-threaded backup is of no use if the restoration speed is slow. After all, the clock is ticking when we need to restore.
Another issue for commodity backup tools is the monolithic backup. Trust me, when you start offloading 20 TB, it is a protracted time-duration process. All such processes require checkpointed junctures. In this case, checkpoints-per-table might suffice until we enter large-table-zone. If only one of our facts contains tens of billions or hundreds of billions of rows, the process to offload this will take protracted durations. We cannot afford to lose the time in case it errors off (runs out of disk space, network hiccup… pick your poison).
The better approach is to offload the data in commonly sized files, like incremental files, so that they represent the machine’s total contents in multiple files. Why do this? When we need to restore data, we rarely need to restore the monolithic content of a massive file. In most cases, we’ve lost some increment of it. Likewise, our offline backup tools (tape, storage, and so on) simply choke on the data sizes common to a Netezza machine. Offloading in chunks relieves the pressure, effectively adapting those environments to the machine. Wouldn’t it be great to have a tool that lets us call up just the file(s) that contain the content you want to restore? Or store to tape in digestible increments?
In Part 2 of this article, we’ll look at the deeper mechanics of the above solutions, why they are necessary, and some points to consider when implementing them.
IBM Big Data, Integration and Governance 2013 Forums
Attend an event near you to learn how leading organizations are making sense of massive amounts and new types of information to create value
DB2 TechTalk: Deep Dive on BLU Acceleration in DB2 10.5, Super Analytics Super Easy
Thursday, May 30: 12:30 – 2:00 PM ET
Informix Chat with the Lab: Primary Storage Manager (PSM) a Parallel Backup Alternative to Ontape
Thursday, May 30: 11:30 – 1 PM ET
Big Data Executive Summit
June 7 (Dallas) and June 10 (San Francisco)
Big Data Seminar 2013, Featuring Krish Krishnan
June 14 in New York City
Hadoop Summit North America
marcus evans Pharma Data Analytics Conference
July 10-11 in Philadelphia
IBM Smarter Content Summit 2013
Big Data at the Speed of Business
Broadcast event replay now available
Information on Demand 2013: Early Bird Registration Now Open
November 3-7 in Las Vegas