The preponderance of big data and the capability to effectively collect, organize, and synthesize key data into information for analysis and query has managed to bring unprecedented challenges on existing relational database management systems. A critical need for many line-of-business users, data scientists, and other analysts is the speed at which vital data can be readily stored, accessed, and segregated from irrelevant data. Compressing the data stored on disk was an important milestone, but there were still storage and speed-of-access limitations in the way traditional relational databases could handle big data and its analysis.
UNIX-based relational databases, in particular, basically all did the same thing by taking tuples—rows—and making sure they were consistently written to disk on a disk page for 100 percent retrieval using SQL, the language of choice. Over time, memory was used to cache data in disk page quantities for faster access than was possible from disk. Then computing power accelerated with advances such as symmetric multiprocessing (SMP), multi-threaded operating systems, multi-threaded databases, clustering these advanced systems, and virtualization.
Within databases, cost-based optimizers, join algorithms, enhancements to the traditional B-tree index, object relational extensions for new data types and access methods, security, multitenancy, and cloud-based computing evolved. Simultaneously, traditional relational databases were characterized by a row-based structure, disk pages that are pulled into memory to be cached, and indexes that are required to accelerate access. In addition, relational databases can also be characterized by the substantial bytes that go to waste on the disk and in cache because it has always been about page size. And the cost/gigabyte of storage and RAM keeps falling, which helps eliminate any concern about inefficient or unused space on a page.
Data compression was first introduced to pack or shrink down disk pages. Algorithms progressed to optimize compression for different data file types. At this level of compression, performance for relational databases is currently highly dependent on the implementations specific vendors provide.
IBM® DB2® and IBM Informix® databases are designed to offer a performance gain because the comparatively smaller amount of computational overhead to maintain data compression is more cost-effective than uncompressing the data. In addition, database administrator (DBA) overhead for performing backups and restores is vastly efficient for these database environments because these operations utilize compressed disk pages.
In DB2 with BLU Acceleration, saving storage is not just about compression from the traditional disk-based perspective. Instead, its compression is about providing an ecosystem of optimization that brings an advanced set of capabilities to take advantage of new algorithms, access methods, and advanced processor capabilities as the core of the architecture. The question of whether data is compressed no longer exists; in the DB2 with BLU Acceleration ecosystem, data is compressed. And this compression is based on the sum of four parts that include storage and access to columnar data, moving key columnar data to memory, skipping irrelevant data, and optimizing processor registers.
Row-based data storage is fast for transaction processing: log the change, commit the change to disk, use indexes to facilitate very fast access to specific data needed for end-user requests, and store the change.
For analytic processing and questions that require a large number of table joins and aggregations, and generally need access to a small percentage of the columns, row-based storage hits a performance wall. Columnar data storage flips the data 90 degrees and stores columns, not rows, together (see Figure 1). As a result, the less data is different, the more it can be compressed by using very efficient algorithms because the data also exists on the same data page. For example, consider how much address-centric data in a US state field can be compressed. Because there are only 52 possible data elements—including the District of Columbia (DC) and Puerto Rico (PR) —across millions or even billions of rows, the opportunity to heavily compress this data is significant.
Columnar storage also removes the need for indexes. In row-based storage, storing indexes can easily consume as much storage as the data itself—if not more. Column-based storage frees all the storage space that would otherwise be necessary for indexes. Columnar data is also inherently fast to access because data records that are the same are stored right next to each other, so building an index for this structure does not help accelerate performance. Similarly, there is no need for indexing with aggregations, materialized query tables (MQTs), or other artificial ways of pre-calculating results into fewer rows depending on the SQL.
Figure 1: Enhancing compression with columnar data storage
With the data compressed in a columnar format on the storage subsystem, the data that is crucial for analysis or answering questions can be transferred to RAM to help accelerate its access. Consider the comparison of seek times. For example, if memory seek time is 8 ns, that seek time is the same amount of time for light to travel 8 feet in a vacuum. Similarly, if disk seek time averages 4 ms, that seek time is the same amount of time for light to travel 745 miles in a vacuum. Memory seek time in this example is 491,700 times faster than disk seek time. In addition, whereas memory cost is approximately USD40,000/TB, the cost for a spinning disk is approximately USD800/TB,1 which means memory is approximately 50 times more costly.
As a result, compression and data layout are maintained as the data moves into RAM (see Figure 2). Most importantly, BLU Acceleration executes the highly expensive work performed with the compressed data for functions like equality predicates, range predicates, grouping and aggregation, and joins. Unlike other in-memory database technologies, BLU Acceleration is designed to maintain in-memory performance, intelligent caching, and efficient use of workspace in system memory for each end user.
Figure 2: Maintaining compression for processor-based workloads
Row-based data caches have become very good at keeping only the important disk pages in memory. Data skipping in BLU Acceleration helps provide a similar function for columnar data. This capability enables avoiding unnecessary processing of irrelevant data and keeping in memory only the data that is actually used for analysis.
The reduced amount of data when it is compressed enables increased throughput when the data passes through the processor—and thus used in the processor’s cache. Uncompressed, row-based data, in contrast, remains very fat and does not fit into the processor registers as well as compressed data. Column-based, compressed data and the work performed using the compressed data provide opportunities to take advantage of this highly important feature for analyzing big data. And the essence of this advance is the capability to reduce a data value to mere bits optimized for processor caching and throughput.
Compression did not evolve merely to fill memory and persistent storage performance gaps. It now allows for bringing together the right storage and access methods to data based on what makes sense for the way in which the data will be processed. DB2 with BLU Acceleration provides the following benefits that result from the sum of compression parts at each level in a machine’s subsystems:
And organizations can derive additional benefits. With these abilities, DB2 with BLU Acceleration can logically calculate and use compression to scan, join, aggregate, and sort even very comprehensive queries. These queries can be run against highly complex data definitions and vast numbers of concurrent end users, while rapidly returning cost-effective results for each question posed.
Please share any thoughts or questions you have in the comments.
1 IBM System x: Enhanced solutions to solve key business problems.
2 Based on testing performed July 2013 at the IBM Competitive Project Office. Actual results may vary and are dependent on many factors including hardware and software design and configuration.
IBM big data in a minute: Bringing the power of Hadoop to the enterprise
Video: The right tool for the job
Nature of analytics video: IBM and the swan of all fears
IBM redesigns its Big Data & Analytics website with IBM Watson Foundations capabilities
Visit a website with comprehensive resources dedicated to the chief data officer role
Podcast: Learn about the InfoSphere Streams project at GitHub