Big Data and Warehousing

Is Your Big Data Hot, Warm, or Cold?

Understanding frequency of data access and change in DB2

A multi-temperature data management solution refers in part to having data that is frequently accessed on fast storage (hot data), compared to less-frequently accessed data stored on slightly slower storage (warm data) and rarely accessed data stored on the slowest storage an organization has (cold data). Developing this solution requires a set of key performance indicators (KPIs) to take your data’s “temperature” and to help make both operational and business decisions involving your data. To demonstrate this, assume that your data is segmented and managed as follows:

  • Data less than or equal to 31 days
  • Data between 31 days and 90 days
  • Data between 91 days and 180 days
  • Data between 181 days and 365 days
  • Data greater than 365 days

While it may be reasonable to assume that two-year-old data is less frequently accessed than data that is 90 days old, understanding the frequency of access and change in greater detail can trigger other business decisions. For example, if a month that has had no changes for the past six months had 100 rows changed in an extract, transform, and load (ETL) cycle, you may not want to take any action. Alternatively, if for that same month 10,000 rows changed in an ETL cycle, you might consider one or more of the following actions:

  • Execute some form of sampling to determine if further analysis is required
  • Re-execute certain reports
  • Investigate the ETL process to understand why there was such a significant change
  • Maintain summary tables and materialized query tables (MQTs) that are affected
  • Use IBM DB2 High Performance Unload on the affected data or the entire table
  • Perform table space backups
  • Reorganize the data (or indexes only)
  • Execute a runstats
  • Perform some form of storage management or archiving

Identifying what data changed, along with the volume and frequency of change, can provide valuable input for operational and business decisions. This article shares some of the available metrics that can help you understand the frequency, volume, percentage of change, and type of actions that constitute changes to your data.

 

Developing key performance indicators

Figure 1 shows a bar graph representation of a table where the months December, November, October, September, August, and July have a higher rate of change and are more heavily accessed than data for the months June, May, April, March, February, and January.

Figure 1: Frequency of access versus frequency of change

 

Understanding frequency of access, rate of change, and other useful metrics

When the database is activated, the “always-on” metrics in DB2 allow quick and easy reporting of metrics, which can then be used to develop a business view of data access patterns and changed data activity. These metrics are available from the time the database was last activated and can be stored in (user-defined) tables for further analysis.

 

Table metrics

Figure 2 lists the key metrics that are available using the table function MON_GET_TABLE for each table and each range partition of a table:

  • Number of times a table or range partition was accessed
  • Number of rows read (table or range partition)
  • Number of rows inserted (table or range partition)
  • Number of rows update (table or range partition)
  • Number of rows deleted (table or range partition)
  • Number of row updates that resulted in no changes to any of the column values of a row (table or range partition)
  • Table space in which table or range partition resides
Figure 2: Table and range partition activity metrics

 

These metrics enable you to answer the following questions:

  • How many rows changed and what was the rate of change for a given time period (when issuing and storing the results of the invoking table function)?
  • How many “newly arrived” rows were processed in a given week?
  • How many update statements were executed that did not result in an update actually occurring (item 6 from Figure 2)?
  • What were the total number of rows that were updated in a table space?
  • How many rows were deleted in a particular time period?

 

Index metrics

While index metrics do not provide input into data temperature, they still help complete the picture of your data by interpreting index utilization and index performance using the table function MON_GET_INDEX. A subset of those metrics is listed in Figure 3:

  • Number of index only scans
  • Number of index access scans
  • Number of key column updates
  • Number of include column updates
  • Number of index jump scans
  • Number of page splits
Figure 3: Subset of index utilization metrics

 

Identifying dormant data

Data that is not accessed for an extended period of time is often referred to as “dormant data”—for example:

  • Data that is kept only to satisfy legal requirements
  • Data that is no longer updated but is still being queried

The system catalog column named LASTUSED is asynchronously updated to store the last date that an object was used or referenced. This information is available for indexes, packages, tables, each range partition of a table, and MQTs. By capturing this information over time, you can identify how long any particular database object has not been accessed.

There are other scenarios for which the LASTUSED column also provides value, such as the dropping of indexes that are not being accessed after a database migration. The following system catalog views contain the LASTUSED column:

  • SYSCAT.DATAPARTITIONS
  • SYSCAT.INDEXES
  • SYSCAT.PACKAGES
  • SYSCAT.TABLES

 

Providing a business view of daily SQL activity

In the article “Going Global with Data Mart Consolidation—Part One: Using Row and Column Security to Give Local Views and Correct Currency Aggregation,” database roles were defined for each country referred to in the article as shown in Table 1:

 

Country Database Role
Argentina ARGENTINA_ROLE
Australia AUSTRALIA_ROLE
Brazil BRAZIL_ROLE
China CHINA_ROLE
Egypt EGYPT_ROLE
Hong Kong HONG_KONG_ROLE
India INDIA_ROLE
Kenya KENYA_ROLE
Mexico MEXICO_ROLE
New Zealand NEW_ZEALAND_ROLE
Singapore SINGAPORE_ROLE
South Africa SOUTH_AFRICA_ROLE
United Arab Emirates UNITED_ARAB_EMIRATES_ROLE
Table 1: Country database roles

 

If you defined a workload for each role (country), you could provide many of the same metrics reported at the table level using the table function MON_GET_TABLE—but instead of being reported from an operational point of view, they would be reported from a business point of view. Figure 4 lists some of the key metrics that become immediately available using the table function MON_GET_WORKLOAD once a workload definition is created:

  • Total CPU time
  • Total CPU utilization
  • Total number of SQL statements executed
  • Rows read
  • Rows modified
  • Rows returned
  • Percentage of I/O wait time
  • Buffer pool hit ratio expressed as a percentage
  • Average number of rows read from the table per rows returned to the application
Figure 4: Subset of metrics available using workload definitions

 

These metrics enable you to answer some of the following business considerations:

  • How many rows did users from Brazil read? How many rows were returned?
  • What percentage of I/O wait time was experienced by users from China after tables were moved from one storage group to another?
  • What was the buffer pool hit ratio for users in India?
  • What was the total amount of time that users in Egypt waited for I/O prefetch?
  • How many rows were returned within a given time period for New Zealand users?
  • What was the total amount of CPU time for users in Kenya?

 

To access these metrics, you create a workload definition and then grant use of the workload. Figure 5 shows a sample created workload named HONG_KONG_ ANALYTICS_WORKLOAD for users who are executing an application named ‘Market_Basket_Analysis’ and who have been granted the role ‘HONG_KONG_ROLE’:

 

CREATE WORKLOAD HONG_KONG_ ANALTYICS_WORKLOAD

APPLNAME (‘Market_Basket_Analysis’)

SESSION USER_ROLE (‘HONG_KONG_ROLE’)

 

GRANT USAGE ON WORKLOAD HONG_KONG_ MARKETING_WORKLOAD

TO ROLE HONG_KONG_ROLE

Figure 5: Workload definition for Hong Kong Marketing

 

After the execution of these two statements (and assuming workload definition selection), metrics will become available for Hong Kong users executing the marketing application.

 

Identifying workload metrics for ETL

The same workload metrics for business applications can be provided for ETL. For example, you could define an ETL workload for each country’s ETL. In addition, by identifying each ETL job uniquely, you could return the same metrics listed in Figure 4 not only by country but also by ETL job.

 

Gathering workload metrics

Figure 6 lists the table functions for gathering workload metrics.

  • MON_GET_WORKLOAD
  • MON_GET_WORKLOAD_SUMMARY
  • MON_SAMPLE_WORKLOAD_METRICS
Figure 6: Table functions for reporting workload metrics

Providing operational and business perspectives

Workload metrics help remove the “noise” when it is necessary to report on business-related activity only. For example, Refresh Deferred MQT maintenance contributes to the metric “rows read” for any table queried as part of maintaining an MQT. With workload metrics, you can filter out such activity to get a better approximation of “business query” activity against a table.

 

Conclusion

A multi-temperature data warehouse solution is based on business requirements, workload management, and intelligent storage provisioning. As shown here, users have a variety of reporting capabilities and metrics available to help understand the utilization of tables, table spaces, and workloads as they relate to the frequency of access and change of data.

Previous post

Embracing Big Data From the Warehouse

Next post

A Framework that Focuses on the “Data” in Big Data Governance

Dan Gibson

From his position as the DB2 Lab Advocate at the Atlanta, Nagano, and Sydney Olympic Games to his Master of Science Degree in Computer Science, Dan knows technology—specifically DB2—inside and out. Since joining IBM over a decade ago, Dan has dedicated his career to diving in with DB2.