The preponderance of big data and the capability to effectively collect, organize, and synthesize key data into information for analysis and query has managed to bring unprecedented challenges on existing relational database management systems. A critical need for many line-of-business users, data scientists, and other analysts is the speed at which vital data can be readily stored, accessed, and segregated from irrelevant data. Compressing the data stored on disk was an important milestone, but there were still storage and speed-of-access limitations in the way traditional relational databases could handle big data and its analysis.
UNIX-based relational databases, in particular, basically all did the same thing by taking tuples—rows—and making sure they were consistently written to disk on a disk page for 100 percent retrieval using SQL, the language of choice. Over time, memory was used to cache data in disk page quantities for faster access than was possible from disk. Then computing power accelerated with advances such as symmetric multiprocessing (SMP), multi-threaded operating systems, multi-threaded databases, clustering these advanced systems, and virtualization.
Within databases, cost-based optimizers, join algorithms, enhancements to the traditional B-tree index, object relational extensions for new data types and access methods, security, multitenancy, and cloud-based computing evolved. Simultaneously, traditional relational databases were characterized by a row-based structure, disk pages that are pulled into memory to be cached, and indexes that are required to accelerate access. In addition, relational databases can also be characterized by the substantial bytes that go to waste on the disk and in cache because it has always been about page size. And the cost/gigabyte of storage and RAM keeps falling, which helps eliminate any concern about inefficient or unused space on a page.
Data compression was first introduced to pack or shrink down disk pages. Algorithms progressed to optimize compression for different data file types. At this level of compression, performance for relational databases is currently highly dependent on the implementations specific vendors provide.
IBM® DB2® and IBM Informix® databases are designed to offer a performance gain because the comparatively smaller amount of computational overhead to maintain data compression is more cost-effective than uncompressing the data. In addition, database administrator (DBA) overhead for performing backups and restores is vastly efficient for these database environments because these operations utilize compressed disk pages.
Optimizing the ecosystem
In DB2 with BLU Acceleration, saving storage is not just about compression from the traditional disk-based perspective. Instead, its compression is about providing an ecosystem of optimization that brings an advanced set of capabilities to take advantage of new algorithms, access methods, and advanced processor capabilities as the core of the architecture. The question of whether data is compressed no longer exists; in the DB2 with BLU Acceleration ecosystem, data is compressed. And this compression is based on the sum of four parts that include storage and access to columnar data, moving key columnar data to memory, skipping irrelevant data, and optimizing processor registers.
Columnar data storage and access
Row-based data storage is fast for transaction processing: log the change, commit the change to disk, use indexes to facilitate very fast access to specific data needed for end-user requests, and store the change.
For analytic processing and questions that require a large number of table joins and aggregations, and generally need access to a small percentage of the columns, row-based storage hits a performance wall. Columnar data storage flips the data 90 degrees and stores columns, not rows, together (see Figure 1). As a result, the less data is different, the more it can be compressed by using very efficient algorithms because the data also exists on the same data page. For example, consider how much address-centric data in a US state field can be compressed. Because there are only 52 possible data elements—including the District of Columbia (DC) and Puerto Rico (PR) —across millions or even billions of rows, the opportunity to heavily compress this data is significant.
Columnar storage also removes the need for indexes. In row-based storage, storing indexes can easily consume as much storage as the data itself—if not more. Column-based storage frees all the storage space that would otherwise be necessary for indexes. Columnar data is also inherently fast to access because data records that are the same are stored right next to each other, so building an index for this structure does not help accelerate performance. Similarly, there is no need for indexing with aggregations, materialized query tables (MQTs), or other artificial ways of pre-calculating results into fewer rows depending on the SQL.
Figure 1: Enhancing compression with columnar data storage
Columnar in-memory storage and access
With the data compressed in a columnar format on the storage subsystem, the data that is crucial for analysis or answering questions can be transferred to RAM to help accelerate its access. Consider the comparison of seek times. For example, if memory seek time is 8 ns, that seek time is the same amount of time for light to travel 8 feet in a vacuum. Similarly, if disk seek time averages 4 ms, that seek time is the same amount of time for light to travel 745 miles in a vacuum. Memory seek time in this example is 491,700 times faster than disk seek time. In addition, whereas memory cost is approximately USD40,000/TB, the cost for a spinning disk is approximately USD800/TB,1 which means memory is approximately 50 times more costly.
As a result, compression and data layout are maintained as the data moves into RAM (see Figure 2). Most importantly, BLU Acceleration executes the highly expensive work performed with the compressed data for functions like equality predicates, range predicates, grouping and aggregation, and joins. Unlike other in-memory database technologies, BLU Acceleration is designed to maintain in-memory performance, intelligent caching, and efficient use of workspace in system memory for each end user.
Figure 2: Maintaining compression for processor-based workloads
Row-based data caches have become very good at keeping only the important disk pages in memory. Data skipping in BLU Acceleration helps provide a similar function for columnar data. This capability enables avoiding unnecessary processing of irrelevant data and keeping in memory only the data that is actually used for analysis.
Processor register optimization
The reduced amount of data when it is compressed enables increased throughput when the data passes through the processor—and thus used in the processor’s cache. Uncompressed, row-based data, in contrast, remains very fat and does not fit into the processor registers as well as compressed data. Column-based, compressed data and the work performed using the compressed data provide opportunities to take advantage of this highly important feature for analyzing big data. And the essence of this advance is the capability to reduce a data value to mere bits optimized for processor caching and throughput.
Combining storage and access for specific processing
Compression did not evolve merely to fill memory and persistent storage performance gaps. It now allows for bringing together the right storage and access methods to data based on what makes sense for the way in which the data will be processed. DB2 with BLU Acceleration provides the following benefits that result from the sum of compression parts at each level in a machine’s subsystems:
- Disk savings: By requiring no indexes and no pre-calculated elements, the ability to compress stored columnar data enables 10 to 15 times less gross storage to be consumed by the database than traditional, row-based data storage. For example, 1 TB of raw data used to require 3 TB of storage on disk. Columnar data storage now requires a total of 180 GB, which is easily a 15 times savings in disk space.
- Memory savings: The data brought into memory stays compressed, and the space needed for each line-of-business user to execute work is greatly reduced. The overhead necessary to maintain the system drops as well. In lab testing performed July, 2013 at the IBM Competitive Project Office using DB2 with BLU Acceleration, the amount of total memory consumed by 80 concurrent end users against 1 TB of raw data was only 220 GB.2 That consumption was not only a result of the compressed data in memory, but it also was used for the working space that each end user required for utilizing that data.
- Processor throughput: Compressed workloads help make query plans easier to calculate and more predictable using columnar data than row-based data, and the amount of work required for the processor drops because it is processing compressed data. As a result, performance and concurrency against a growing data set are highly enhanced on reduced processor footprints.
And organizations can derive additional benefits. With these abilities, DB2 with BLU Acceleration can logically calculate and use compression to scan, join, aggregate, and sort even very comprehensive queries. These queries can be run against highly complex data definitions and vast numbers of concurrent end users, while rapidly returning cost-effective results for each question posed.
Please share any thoughts or questions you have in the comments.
1 IBM System x: Enhanced solutions to solve key business problems.
2 Based on testing performed July 2013 at the IBM Competitive Project Office. Actual results may vary and are dependent on many factors including hardware and software design and configuration.