By Tom Deutsch
By Nancy Kopp
By Paula Wiles Sigmon
By Joe Borges
By Stuart Litel
By Lester Knutsen
By James Kobielus
By Cristian Molaro
By Leon Katsnelson
By Susan Visser
By Bernie Spang
By the DB2 Guys
By Fred Ho
By Louis T. Cherian
By Shweta Shandilya
By Lawrence Weber
By Serge Rielau
By Dwaine Snow

Organizations need to work with production data for test purposes. The amount of data or type of data needed to satisfy test requirements will vary depending on the focus of the test cases. Depending on the requirements of the testing initiatives (that is, performance, boundary conditions, volume, and so on), different subsets of data from the production system will be required to meet the overall test strategy and objectives. The IBM® InfoSphere® Optim™ Test Data Management solution allows you to implement a production data subsetting strategy that will meet your testing objectives.
Several subsetting strategies exist for extracting data from your production environment. The process of extracting data from the production environment results in an Optim Extract file. However, the creation of this file is only half of your test data provisioning strategy. Strategies for populating the test environment and options available will be covered in upcoming articles.
Optim Test Data Management software provides several options when extracting the data from the production environment. To use those options effectively, the user designing the extract service should have some knowledge of the data models from which data is to be extracted, as well as the requirements within the test environment.
For organizations with limited production windows during which queries can be performed, the option to clone and privatize the production data is appealing. Optim can be used to extract all production data at once—privatizing it before landing it as the gold master extract file.
The Optim access definition is a user-created data model for extracting data. In addition to capturing which tables are to be extracted from the data store, it also contains the physical and logical relationships used in the data traversal process. An access definition can contain any number of tables with or without relationships.
Once the access definition is in place, the designer of the access definition determines which (if any) of the data extraction methods are necessary.
If a targeted set of data meeting specific criteria is desired, Optim provides the ability to specify SQL WHERE clauses on one or more tables in the data model.
In the example below, the data model appears as follows:

The start table (or starting point for traversal) begins with the CUSTOMERS table. This example operation requires the extraction of all customers in the state of New Jersey who are represented by sales representatives responsible for the city of Trenton.
To add these criteria to the access definition, the user would enter the table list and apply the criteria for each table.

After all criteria have been applied, they appear in the Access Definition Editor.


If testing requirements are limited to only a few specific rows, Optim includes the ability to provide a list of primary keys. All rows in the data store matching the key values are extracted. The primary key list is shown as a “Point and Shoot” list within Optim. You can use this list in conjunction with the selection criteria or as the only means for subsetting data.
This example shows how a Point and Shoot list can be used with the selection criteria supplied above. The resulting Optim Extract file will contain all customers in the state of New Jersey with sales representatives for the city of Trenton, plus five additional customers with the customer IDs listed below.
After choosing the Point and Shoot option, the user can select the desired rows using the Point and Shoot Editor.

In this example, selecting five rows results in a Point and Shoot file containing the comma-separated values for each row’s primary key.

This file is then supplied on the extract operation where any row containing the matching primary keys is extracted from the production database and stored in the Optim Extract file.
Optim offers various ways to select a sampling of rows. One method is to specify a sampling factor for the desired tables. This sampling factor is referred to as “Every Nth” in the access definition and dictates the interval in which rows within a table will be extracted. For example, if you enter a value of 10, the process will extract every tenth row in the table starting with the tenth row.
A second method is known as a row limit. This numeric value represents the maximum number of rows that will be extracted for a given table. Row limit and Every Nth can be used together.

Another powerful sampling mechanism is called grouping. With this method, you can extract a number of rows based on values in a particular column in the start table within your access definition. You simply select the column in the start table that contains the values you want to use for grouping. For example, you can select the STATE column in the CUSTOMERS table as the grouping column. Next, you would specify the number of unique groups that should appear as part of the extract result. Within a group, rows have the same value in the column selected for grouping. The final step is to specify the number of rows you want in each group.

Examples of data grouping using the CUSTOMERS table’s STATE column as the grouping column are provided below.
|
Number of Unique Groups |
Number of Rows per Group |
|
| To extract all rows of customer data from any 10 states |
10 |
<empty> |
| To extract 50 rows of customer data from all states |
<empty> |
50 |
| To extract 50 rows of customer data from any 10 states |
10 |
50 |
Note: Selection criteria can be combined with the grouping attributes to further refine the sampling. In the preceding example, selection criteria could have been used to refine the sample to 10 specific states.
IBM Big Data, Integration and Governance 2013 Forums
Attend an event near you to learn how leading organizations are making sense of massive amounts and new types of information to create value
DB2 TechTalk: Deep Dive on BLU Acceleration in DB2 10.5, Super Analytics Super Easy
Thursday, May 30: 12:30 – 2:00 PM ET
Informix Chat with the Lab: Primary Storage Manager (PSM) a Parallel Backup Alternative to Ontape
Thursday, May 30: 11:30 – 1 PM ET
Big Data Executive Summit
June 7 (Dallas) and June 10 (San Francisco)
Big Data Seminar 2013, Featuring Krish Krishnan
June 14 in New York City
Hadoop Summit North America
June 26-27
Big Data and the Enterprise: A Perspective from Featured Gartner Analyst Donald Feinberg
July 11: 11AM ET
marcus evans Pharma Data Analytics Conference
July 10-11 in Philadelphia
IBM Smarter Content Summit 2013
Register now!
Big Data at the Speed of Business
Broadcast event replay now available
Information on Demand 2013: Early Bird Registration Now Open
November 3-7 in Las Vegas