How to Improve Performance in Warehouse Systems Using Replicated MQTs
In any Database Partitioning Feature (DPF) warehouse environment, the replicated Materialized Query Table (MQT) plays an important role in improving database performance. The performance improvement can be significant for the frequently joined SQL statements between the bulky fact tables and tiny dimension tables. The improvement is achieved by assisting the joins, presenting the data locally instead of broadcasting the data across all the data partitions.
One common approach to avoid any data broadcast is to replicate the dimension tables across the data partitions. This approach facilitates co-location joins instead of non-co-location joins. A co-location join can only occur when certain criteria are met:
- Both joining tables are in the same database partition group.
- The distribution keys for both tables have same number of columns and compatible data types.
- The distribution key columns are partition-compatible.
Let’s look at an example of how to convert a non-co-location join to a co-location, and how that conversion improves performance.
Enhancing performance with a few easy steps
For this example, we used the following environment:
- IBM InfoSphere Balanced Warehouse D5100 ran IBM DB2 9.7 FP5 on SUSE Linux Enterprise Server (SLES) 10.
- 11 physical servers were equipped with four 2800 MHz CPUs and 32 GB of memory.
- The database included 41 data partitions (40 data partitions and one coordinator partition).
- The fact table included 508 million records spread over 40 data partitions.
- The dimension table included 111,146 records in a single admin node coordinator partition.
- The Query Optimization DB CFG DFT_QUERYOPT was set to 2.
This code shows how to use MQT for best performance:
FROM DATAMARTS.FACT_CUST_PROFTBLTY F,
Figure 1. Data access plan when the MQT is not available in the database
All the dimensional table DIM_CUSTOMER data were sent to all the data partitions using BTQ (Broadcast Table Queue—Q1), and no hashing was done to determine the receiving partitions (see Figure 1).
With this setup, changing the row relocation from BTQ to a simple co-location involves just a few easy steps. First, create table spaces TBS_REP_DATA and TBS_REP_INDEX in the same database partition group as in the fact table database partition group.
Next, replicate the table—make sure you have a unique index present on the table before replicating.
db2 +v ”
CREATE TABLE DATAMARTS.R_DIM_CUSTOMER AS
(SELECT * FROM DATAMARTS.DIM_CUSTOMER)
DATA INITIALLLY DEFERRED
REFRESH IMMEDIATE IN TBS_REP_DATA
INDEX IN TBS_REP_INDEX REPLICATED”
db2 +v “REFRESH TABLE DATAMARTS.R_DIM_CUSTOMER”
Then create any other custom indexes manually, since they will not be created automatically. You will not be able to create a unique index or the constraints on the replicated tables; all constraints should be created on the base table. Finally, perform REORG and RUNSTATS on the replicated table.
Upon creation of the replicated table, the explain plan illustrates the elimination of BTQ and shows instead a local co-located join.
Figure 2. Data access plan when the MQT is available in the database
The total execution time was 23.8 times faster with MQT—an improvement of 95.8.
Table 1. Execution time and total query cost results with and without MQT
|Case||Total Query Cost (Timerons)||Execution Time (Seconds)||Improvement|
Boosting performance further with an IBM customer
These results have been replicated with IBM customers. The UK-based consumer goods company Reckitt Benckiser was looking for a way to optimize data access plans on top of replication. With help from IBM, the company conducted a proof of concept that demonstrated a significant improvement in performance with the co-location approach compared with an earlier BTQ.
In assessing the existing database design, the IBM team identified several issues:
- There were 77 bulky fact tables created in a single database partition group, spread across 40 data partitions. The replicated tables were also in the same database partition group.
- Since there could be only one distribution map, it was very difficult to consistently achieve very minimal data and processing skews in all the fact tables.
During the exercise, the IBM team identified five fact tables that were very important for the business reporting and had very similar, frequently joined columns.
With just a few steps, the IBM team was able to achieve another three percent gain in performance compared with the lab results reported above.
- The team created a separate database partition group and a tablespace for this very important set of tables (five of them in this case).
- The team then copied the data across the new tablespace and renamed the old tables. This step limited the data skew from 12 percent to less than 3 percent. The limitation was around the replicated MQT DATAMARTS.R_DIM_CUSTOMER. The DB2 optimizer cannot evaluate the option of using it to make a co-located join because we used two different database partition groups.
- To overcome the limitation, the IBM team created one more set of tablespaces in the new database partition group for the second set of replicated MQT and the replication on top of it.
db2 +v ”
CREATE TABLE DATAMARTS.R2_DIM_CUSTOMER
AS (SELECT * FROM DATAMARTS.DIM_CUSTOMER)
DATA INITIALLLY DEFERRED
REFRESH IMMEDIATE IN TBS_REP2_DATA
INDEX IN TBS_REP2_INDEX REPLICATED”
db2 +v “REFRESH TABLE DATAMARTS.R2_DIM_CUSTOMER”
Table 2. Execution time results with and without MQT
|Case||Execution Time (Seconds)||Improvement|
The DB2 DPF environment is a perfect scale-out solution for any database warehouse system. As the examples above demonstrate, the MQT approach can be very useful in improving the business reporting systems performance and ultimately delivering a better end-user experience.
Any questions? Let me know if you have any challenges with this approach.