Overcoming the Test Data Conundrum
Here’s a riddle for you: I’m thinking of something that every organization needs. Every organization can generate it, but generating it quickly and easily can often lead to elevated risk and elevated spending. What am I thinking of? Why, test data, of course.
We at IBM talk a lot these days about helping to build a smarter planet. To a significant degree, what makes your company look smart, in the eyes of customers and stakeholders, is your applications—applications that boost operating efficiency, engage buyers, uncover new market opportunities, and accelerate product development. If these applications are to be of high quality—and they’d better be—they must be thoroughly tested before being promoted to production.
Effective application testing, in turn, has a data dependency. Get the data part wrong, and you’re inviting problems. Does the test data contain too-small subsets of the values of key domains? The result could be data-handling errors that go undiscovered until encountered by application users. Are test database tables and files inconsistent from a referential integrity perspective? That could require corrective action, and the needed data repair work could delay completion of test runs, which can potentially cause application implementation target dates to slip. Invalid values in your test data could produce false positives with regard to application code errors, causing time to be wasted in diagnosing and maybe even “fixing” problems that would not occur in production because they were triggered by data values that would not exist in the real-world production environment.
What to do?
The simple approach can bite you
Recognizing the importance of test data completeness and quality, some organizations opt for what appears on the surface to be the simplest approach: they just copy their production data—all of it—into their program development and test environments. Going this route can reduce (though maybe not eliminate, as I’ll point out) problems associated with test data being too unrepresentative of the production environment, but it also introduces some very significant challenges.
It can seriously undermine your data privacy breach control efforts.
If there’s anything you don’t want, it’s to see your company’s name in the headline of a news story reporting a leak (or a flood) of highly sensitive data such as credit card information, government identity numbers, account IDs, or passwords from your system. This situation could severely damage your organization’s reputation among customers and potential customers—and if the sensitive data were to get into the hands of cyber-criminals, a lot of people could end up being hurt financially and otherwise.
Mindful of this threat, your IT security people have probably already taken multiple steps to really lock down data in your production systems. But when that data is copied into development and test environments, what’s the increase in the number of users who have read access to the database tables and files that contain sensitive data fields? Does that grow by a factor of five? Ten? One hundred? That’s a big problem, because data privacy breaches are often caused by the actions—sometimes accidental, sometimes not—of insiders.
It can cost a lot of money.
Last time I checked, no vendor is giving away disk storage capacity. Back in the day when saying “Our production database occupies a terabyte of disk space” got you some oohs and aahs, the thought of replicating that whole thing (maybe two or three times) in test and development environments didn’t cause you too much heartburn. Nowadays, data marts might have a terabyte or more of data, for crying out loud. Your production database mothership might have tens of terabytes of data and indexes (or maybe a petabyte or more), and individual tables might have billions of rows of data. Is consuming THAT MUCH disk space in test and development systems really cost-justifiable, just to ensure test data completeness and data value validity?
Oh, and disk space is just part of what you’re spending with this “dump and hope” approach to test data generation. How much production system CPU time is consumed in unloading a massive database, and how many cycles are consumed in shoving all that data into test tables and files? And how long do your programmers have to sit around twiddling their thumbs while one of these huge data-load jobs runs long? Are you charged with running a tight IT ship, cost-wise? This is not the way to do that.
One thing you can pretty much count on in an application development system is database design changes. This is not about people changing things for the sake of change. It’s about trying out design modifications that could deliver improvements in areas such as application performance and extensibility. One table is split into two tables, columns are added to tables, a column’s data type is changed—this is stuff that happens, and often happens for the better.
But when your means of populating test and development databases is a straight unload and load, target data structures have to match source structures. What does that mean? It means that you have to have a development and test database (and all its disk space) that just receives the data unloaded from production. Then you need another database that reflects the current design in development (likely to have some differences versus production). Getting data from the production copy database into the current-design development database becomes a task for programmers and development-supporting DBAs, and do you really want the cycles of your skilled IT professionals consumed in that kind of mapping work? How much does that slow you down when you’re trying to get new apps out the door?
So, forklifting the production database into your test and development environment increases the risk of data privacy breaches, increases storage and CPU capacity requirements, and impedes flexibility and agility. Other than that, it’s a good approach.
There’s a better way
The Age of Smart has arrived for test data generation, in the form of IBM’s InfoSphere Optim Test Data Management Solution offering (which I’ll call Optim TDM, for short). Optim TDM is a comprehensive solution that addresses test data generation needs across a variety of platforms, DBMSs, and file systems—including DB2 for z/OS, IMS, and VSAM on IBM System z servers. For all of the test data generation challenges mentioned previously, Optim TDM provides the tools necessary to get the job done securely, efficiently, and responsively and offers several capabilities.
Optim TDM provides a variety of mechanisms by which sensitive data in your production system can be protected in test and development environments. Combine Optim TDM with InfoSphere Optim Data Masking Solution, and you have even more sophisticated data protection options available to you.
“Shuffling” can be a good choice when it’s not so much field A and field B that have to be protected, but the combination of A and B that has to be secured. So, for example, it may be that an identification number for person X is of no value to a hacker if it is paired with the name of person Y. Perhaps data masking is the way to go, and if you go that route the Optim solution can ensure that generated values appear real. Masking can be accomplished in a random fashion, or in a way that protects sensitive data while preserving referential integrity relationships (a technique sometimes referred to as “repeatable masking”).
There are also lookup tables and routines can be used to securely provide valid values for names, mailing addresses, and e-mail addresses. Want to design a customized data transformation routine to address requirements that are unique to your environment? The Optim solution helps you there, too.
Test data right-sizing
As I pointed out earlier, dumping a huge production database into a test and development environment (or maybe into several such environments) can be a costly endeavor. On top of that, a lot of developers want smaller sets of test data—sometimes just one percent of what is in production. In addition, program testers may want a different percentage of production data volume versus program developers.
Delivering on these requests can pose a problem with respect to data completeness. How do you generate truly useful subsets of related data? In other words, if I get one percent of the rows from “parent” table Y, how do I ensure that I get the corresponding rows from referentially related tables? Optim TDM has you covered there, not only taking into account your database’s referential integrity constraints but also helping to ensure that you get the right subsets of production data into your test and development systems.
Maybe you just want to do some sampling (“get me every nth row from this table, and stop after you have 1 million rows”), or perhaps you want to focus on a particular subset of records (e.g., records for customers who live in the province of Ontario). You might want to get very precise in your selection of production data records to be brought into a test or development environment, working in “point and shoot” mode (“I want this one, that one, and that one”). Whatever approach you want to take to obtain a right-sized, referentially consistent (and appropriately masked) subset of production data for a test or development system, Optim TDM makes it easy.
Accommodating source and target database design differences
Optim TDM provides table- and column-mapping functions that provide the flexibility you need to handle data model differences between production and development systems. Just show what the mapping is, and Optim TDM will run it.
Now, with all these options and all this flexibility, you might think that managing Optim TDM would be difficult or cumbersome. Not so. The solution includes a Self Service Center for Test Data Management. This browser-based interface lightens the load on DBAs by enabling testers and developers to get more involved with the definition and execution of workflows and processes that pertain to test data generation (such as a process for approving test data refresh requests). DBAs can use the tool to drive data refresh actions. The Self Service Center both boosts productivity and imposes a measure of discipline on test and development data generation activities.
You might still be shaking your head, thinking, “All this about securely generating referentially complete and consistent subsets of production data for test and development purposes sounds good, but it presupposes that you know what all those referential relationships are—and frankly, we don’t.” First, that’s nothing to hang your head about, what with the use at many sites of application-based referential integrity, and the existence of data entity relationships that cross database boundaries (data in table A might be referentially related to data in file B). Second, we’ve got a solution for that problem, too. It’s called InfoSphere Discovery, and it can be an excellent complement to Optim TDM.
InfoSphere Discovery does what the name implies: it literally examines data stores—within or outside of a databases management system—and reports on apparent data relationships that it finds. Armed with that information, you can use Optim TDM to provide testers and developers with data that is referentially complete and consistent—without having to copy every last byte of production data into a test system. You right-size your test data while avoiding gaps that could have resulted from blind spots in your knowledge of data relationships in your production environment—blind spots that are eliminated via InfoSphere Discovery.
The innovation angle
Organizational success these days has a lot to do with new and fast: quickly identifying potential new customers, quickly getting new products and services to market, quickly determining what leads to new business from existing customers. IT is expected to enable these capabilities. Consider what the IBM InfoSphere Optim Test Data Management Solution can do to make test data generation part of your company’s innovation engine, versus being something that gums up the works.
Please share any thoughts or questions in the comments.