Is Your Big Data Hot, Warm, or Cold?
A multi-temperature data management solution refers in part to having data that is frequently accessed on fast storage (hot data), compared to less-frequently accessed data stored on slightly slower storage (warm data) and rarely accessed data stored on the slowest storage an organization has (cold data). Developing this solution requires a set of key performance indicators (KPIs) to take your data’s “temperature” and to help make both operational and business decisions involving your data. To demonstrate this, assume that your data is segmented and managed as follows:
- Data less than or equal to 31 days
- Data between 31 days and 90 days
- Data between 91 days and 180 days
- Data between 181 days and 365 days
- Data greater than 365 days
While it may be reasonable to assume that two-year-old data is less frequently accessed than data that is 90 days old, understanding the frequency of access and change in greater detail can trigger other business decisions. For example, if a month that has had no changes for the past six months had 100 rows changed in an extract, transform, and load (ETL) cycle, you may not want to take any action. Alternatively, if for that same month 10,000 rows changed in an ETL cycle, you might consider one or more of the following actions:
- Execute some form of sampling to determine if further analysis is required
- Re-execute certain reports
- Investigate the ETL process to understand why there was such a significant change
- Maintain summary tables and materialized query tables (MQTs) that are affected
- Use IBM DB2 High Performance Unload on the affected data or the entire table
- Perform table space backups
- Reorganize the data (or indexes only)
- Execute a runstats
- Perform some form of storage management or archiving
Identifying what data changed, along with the volume and frequency of change, can provide valuable input for operational and business decisions. This article shares some of the available metrics that can help you understand the frequency, volume, percentage of change, and type of actions that constitute changes to your data.
Developing key performance indicators
Figure 1 shows a bar graph representation of a table where the months December, November, October, September, August, and July have a higher rate of change and are more heavily accessed than data for the months June, May, April, March, February, and January.Figure 1: Frequency of access versus frequency of change
Understanding frequency of access, rate of change, and other useful metrics
When the database is activated, the “always-on” metrics in DB2 allow quick and easy reporting of metrics, which can then be used to develop a business view of data access patterns and changed data activity. These metrics are available from the time the database was last activated and can be stored in (user-defined) tables for further analysis.
Figure 2 lists the key metrics that are available using the table function MON_GET_TABLE for each table and each range partition of a table:
- Number of times a table or range partition was accessed
- Number of rows read (table or range partition)
- Number of rows inserted (table or range partition)
- Number of rows update (table or range partition)
- Number of rows deleted (table or range partition)
- Number of row updates that resulted in no changes to any of the column values of a row (table or range partition)
- Table space in which table or range partition resides
These metrics enable you to answer the following questions:
- How many rows changed and what was the rate of change for a given time period (when issuing and storing the results of the invoking table function)?
- How many “newly arrived” rows were processed in a given week?
- How many update statements were executed that did not result in an update actually occurring (item 6 from Figure 2)?
- What were the total number of rows that were updated in a table space?
- How many rows were deleted in a particular time period?
While index metrics do not provide input into data temperature, they still help complete the picture of your data by interpreting index utilization and index performance using the table function MON_GET_INDEX. A subset of those metrics is listed in Figure 3:
- Number of index only scans
- Number of index access scans
- Number of key column updates
- Number of include column updates
- Number of index jump scans
- Number of page splits
Identifying dormant data
Data that is not accessed for an extended period of time is often referred to as “dormant data”—for example:
- Data that is kept only to satisfy legal requirements
- Data that is no longer updated but is still being queried
The system catalog column named LASTUSED is asynchronously updated to store the last date that an object was used or referenced. This information is available for indexes, packages, tables, each range partition of a table, and MQTs. By capturing this information over time, you can identify how long any particular database object has not been accessed.
There are other scenarios for which the LASTUSED column also provides value, such as the dropping of indexes that are not being accessed after a database migration. The following system catalog views contain the LASTUSED column:
Providing a business view of daily SQL activity
In the article “Going Global with Data Mart Consolidation—Part One: Using Row and Column Security to Give Local Views and Correct Currency Aggregation,” database roles were defined for each country referred to in the article as shown in Table 1:
|United Arab Emirates||UNITED_ARAB_EMIRATES_ROLE|
If you defined a workload for each role (country), you could provide many of the same metrics reported at the table level using the table function MON_GET_TABLE—but instead of being reported from an operational point of view, they would be reported from a business point of view. Figure 4 lists some of the key metrics that become immediately available using the table function MON_GET_WORKLOAD once a workload definition is created:
- Total CPU time
- Total CPU utilization
- Total number of SQL statements executed
- Rows read
- Rows modified
- Rows returned
- Percentage of I/O wait time
- Buffer pool hit ratio expressed as a percentage
- Average number of rows read from the table per rows returned to the application
These metrics enable you to answer some of the following business considerations:
- How many rows did users from Brazil read? How many rows were returned?
- What percentage of I/O wait time was experienced by users from China after tables were moved from one storage group to another?
- What was the buffer pool hit ratio for users in India?
- What was the total amount of time that users in Egypt waited for I/O prefetch?
- How many rows were returned within a given time period for New Zealand users?
- What was the total amount of CPU time for users in Kenya?
To access these metrics, you create a workload definition and then grant use of the workload. Figure 5 shows a sample created workload named HONG_KONG_ ANALYTICS_WORKLOAD for users who are executing an application named ‘Market_Basket_Analysis’ and who have been granted the role ‘HONG_KONG_ROLE’:
SESSION USER_ROLE (‘HONG_KONG_ROLE’)
GRANT USAGE ON WORKLOAD HONG_KONG_ MARKETING_WORKLOAD
TO ROLE HONG_KONG_ROLE
Figure 5: Workload definition for Hong Kong Marketing
After the execution of these two statements (and assuming workload definition selection), metrics will become available for Hong Kong users executing the marketing application.
Identifying workload metrics for ETL
The same workload metrics for business applications can be provided for ETL. For example, you could define an ETL workload for each country’s ETL. In addition, by identifying each ETL job uniquely, you could return the same metrics listed in Figure 4 not only by country but also by ETL job.
Gathering workload metrics
Figure 6 lists the table functions for gathering workload metrics.
Providing operational and business perspectives
Workload metrics help remove the “noise” when it is necessary to report on business-related activity only. For example, Refresh Deferred MQT maintenance contributes to the metric “rows read” for any table queried as part of maintaining an MQT. With workload metrics, you can filter out such activity to get a better approximation of “business query” activity against a table.
A multi-temperature data warehouse solution is based on business requirements, workload management, and intelligent storage provisioning. As shown here, users have a variety of reporting capabilities and metrics available to help understand the utilization of tables, table spaces, and workloads as they relate to the frequency of access and change of data.