By Leon Katsnelson
By Susan Visser
By Bernie Spang
By the DB2 Guys
By Fred Ho
By Louis T. Cherian
By Shweta Shandilya
By Lawrence Weber
By Serge Rielau
By Dwaine Snow
My last two columns discussed why and how we use dimensional modeling to organize the data in a data warehouse. To briefly reiterate, the goal of a data warehouse is to give users fast access to data so they can make better business decisions, and to give them a separate area to run queries and play with data so they don’t affect the operational database. The goal of dimensional modeling is to optimize warehouse data for easy queries and analysis by business users.
There are three key parts of a dimensional model: fact tables, dimension tables, and summary tables. We have already covered fact and dimension tables, so in this column, we will look at how to design summary tables.
Summary tables are an important part of creating a high-performance data warehouse. A summary table stores data that has been aggregated in a way that answers a common (or resource-intensive) business query. Summary tables are all about speed. They’re smaller than fact tables, which means they generally respond more quickly (fewer rows to query), and they deliver answers without calculating every result from scratch.
Most of the time the summary table contains the results of an SQL SUM function, but you can also use the COUNT, MAX, MIN, or AVERAGE functions as needed. The COUNT function is very useful to show at any level how many records make up each aggregated row in the summary table.
Ideally, the structure of the data warehouse offers a summary table for any question that business users want to ask. From the summary table, they can then pick the dimensions that they want to drill down into and see more detail. Users should be able to drill down from one summary table into another summary table, ultimately arriving at the fact table (if that’s the level of detail they need).
A well-designed data warehouse has a pyramid of summary tables, each one containing more detail until you arrive at the fact table, at the bottom of the pyramid. The summary table at the top of the pyramid is small, so queries on it are very fast. In fact, the goal is to have 90 percent of your queries running against the summary tables so the user gets very fast responses. As the user drills down into more detail, each summary table contains more rows, until the user reaches the fact table (see Figure 1).
Figure 1: Pyramid of summary tables
For example, when I built a financial data warehouse for a government agency many years ago, they had a fact table with millions of records containing every debit and credit transaction in their financial system for several years. A single payment could have had 6 to 12 debit and credit transactions. However, most users did not want to see this level of detail; they needed to see only the income and expense results of these transactions.
We built a summary table that aggregated the income and expense detail to show the end result for each transaction. This table still contained millions of records, so we built a summary table to aggregate the detail at a monthly level, a table to aggregate the detail at a year-to-date level, and a final summary table for the division level. This pre-stored, aggregated data allowed agency employees to see results very quickly at the level they needed. When they found issues that required more detail, they could then drill down into another summary table or the detail fact table.
As you might guess, summary tables are the starting point for most canned reports and dashboards. Summary tables are like data cubes, and a cube is basically a separate software product or technology that creates summary tables. Hence, you can do everything with summary tables that you do with cubes.
One way to build summary tables is to put an SQL trigger on the fact table, so every time a record gets inserted into the fact table, the trigger updates the summary table. That way you never have to worry about building the summary table again and again. Another way is to have the extract, transform, and load (ETL) process update the summary table as it loads the fact table. A third method is to have the summary tables rebuilt after the ETL process using SQL SUM and aggregate functions. Higher-level summary tables in a pyramid of tables may be built off lower-level tables, rather than off of the fact table.
A key design decision is the grain of aggregate tables. When you start with a fact table that has billions and billions of records, you need to plan summary tables in a hierarchy that summarizes the detail with the goal of reducing the record count by a factor of 100,000 to 1,000,000. Your top-level summary table may have about 100,000 records; the next-level summary table will probably have millions of records. Finally, your detailed fact table will have billions of records.
Another benefit of summary tables is that they let you compare results from two different fact tables (maybe even in separate data marts) that have different grains. Matching up grain levels when building comparisons is a critical concept in designing a data warehouse. One project I developed had financial data at the detail grain transaction level and program accomplishments at a day grain level. It was not possible to compare dollars spent with accomplishments without summarizing the dollars to the same grain as that of the program level. We built one summary table that contained dollars and program accomplishments at the same level, so users could see what it cost per program accomplishment and which offices were the most productive for the lowest costs.
This is how your users are going to make decisions—by comparing facts from different fact tables with different grains. No matter what data your users want to compare—budgeted expenses to actual, planned income to actual, projected sales to actual, or sales to inventory—the data must be summarized to the same grain so the results are meaningful. If you don’t summarize data to the same level of detail, users run the risk of reporting on incorrect data or drawing incorrect conclusions from the data.
Designing a dimensional data warehouse is a complex subject, and I have just touched the surface of it in these three articles. I would like to share some other resources that offer more information (see sidebar, “Resources”). The first is from IBM Redbooks, and I am really excited to see it published: Data Warehousing with the Informix Dynamic Server. You can download it from the IBM Redbooks Web site.
A second resource I have found helpful is a manual titled Designing and Implementing a Database, available in the IBM Informix Dynamic Server11.5 Information Center and as a separate download. The entire Section 4 is on designing dimensional databases, and the manual offers a very good overview of dimensional data modeling and implementing a dimensional database design with Informix.
I also did a series of Webcasts earlier this year on this subject, available on the Advanced DataTools Web site. Happy designing!
DB2 TechTalk: Deep Dive on BLU Acceleration in DB2 10.5, Super Analytics Super Easy
Thursday, May 30: 12:30 – 2:00 PM ET
Informix Chat with the Lab: Primary Storage Manager (PSM) a Parallel Backup Alternative to Ontape
Thursday, May 30: 11:30 – 1 PM ET
Big Data Seminar 2013, Featuring Krish Krishnan
June 14 in New York City
marcus evans Pharma Data Analytics Conference
July 10-11 in Philadelphia
IBM Smarter Content Summit 2013
Big Data at the Speed of Business
Broadcast event replay now available
Information on Demand 2013: Early Bird Registration Now Open
November 3-7 in Las Vegas