Large-Scale Data Management: Part 1
Think about what happens when your brand-spankin’-new PureData/Netezza machine arrives: you can’t wait to break the plastic and get cracking. A few months down the road, after production rollout, you notice that the machine’s data is growing. That’s the way of big data. First it’s a few terabytes, then ten, then 20—and before you know it, you have some real data volume on your hands.
For all of us who have been in this situation, it behooves us to ask ourselves a few obvious questions before we reach that point. Not that we might need more storage, but now we need to manage the data in a very consistent and predictable way. For warehouses of this size, we don’t want to repeat any processing work—we would rather replicate the results. And of course, if we intend to use another PureData/Netezza machine as the DR or backup machine, we need bi-directional replication as well.
We also need to pump the results to a DR server as-they-are-processed in case of hardware failure in high availability scenarios. And when the primary machine fails, we need to switch over to the secondary immediately. When we get the primary back online, we need the secondary to start pumping data back in order to re-sync the machines (including their sequence generators and logs). Can your replicator do that—for tens of terabytes?
Let’s not forget that in data processing, bad data is a fact of life. But in this scenario, it will exist at a very large scale. Let’s say you crank up that ETL tool and load four dimension tables and a fact table. All of the data, except for one of the dimensions, arrives safely. Now what do you do? Do you fix the dimension and reload? If you cannot fix the dimension, do you back out the other tables? What if you just, in true ELT fashion, loaded 20 or 30 tables, or even 100 or more? How do you back out the data you just loaded? Or for that matter, what happens if someone comes to you on Wednesday and says, “The data you loaded on Monday is junk. Back it out.”
Easier said than done. You need a rollback mechanism. Can your ETL tool do that?
Similarly, you need backups. Wait—did you remember to use the compression features? 4x compression makes the queries run 4x faster (no kidding). So does 16x compression makes them run 16x faster? Yes. The problem is that if we have a 40 TB database and want to archive it in a general-purpose tool, we have to make room for 4x that volume uncompressed. That’s 160 TB of raw storage.
We could use the onboard backup utility to offload the data compressed. Just one problem, though: While the backup utility may perform incremental backups at high speed, it won’t restore at high speed. Alas, a rapid, multi-threaded backup is of no use if the restoration speed is slow. After all, the clock is ticking when we need to restore.
Another issue for commodity backup tools is the monolithic backup. Trust me, when you start offloading 20 TB, it is a protracted time-duration process. All such processes require checkpointed junctures. In this case, checkpoints-per-table might suffice until we enter large-table-zone. If only one of our facts contains tens of billions or hundreds of billions of rows, the process to offload this will take protracted durations. We cannot afford to lose the time in case it errors off (runs out of disk space, network hiccup… pick your poison).
The better approach is to offload the data in commonly sized files, like incremental files, so that they represent the machine’s total contents in multiple files. Why do this? When we need to restore data, we rarely need to restore the monolithic content of a massive file. In most cases, we’ve lost some increment of it. Likewise, our offline backup tools (tape, storage, and so on) simply choke on the data sizes common to a Netezza machine. Offloading in chunks relieves the pressure, effectively adapting those environments to the machine. Wouldn’t it be great to have a tool that lets us call up just the file(s) that contain the content you want to restore? Or store to tape in digestible increments?
In Part 2 of this article, we’ll look at the deeper mechanics of the above solutions, why they are necessary, and some points to consider when implementing them.