To most IT professionals, the term big data conjures up visions of hundreds of terabytes of data flowing through a network with the speed of a whirlwind while users attempt to execute complex queries with minimal elapsed time. This vision of big data as “volume, velocity, and variety” is an important one: it allows IT infrastructure management to break down the idea into parts that are more easily managed.
Database administration is an important component of the IT infrastructure. The DBA is responsible for data backup, recoverability, and performance tuning. A big data environment requires the DBA to look at things a bit differently than with more traditional systems.
The first aspect of big data is… lots of data! Large databases by themselves are a challenge to manage. However, DBAs must also consider data movement and data retrieval requirements.
For the DBA, the big data environment means that there is no time (and usually not enough disk or magnetic tape) for database reorgs or database backups. This also means that the databases themselves must be defined with these considerations in mind:
The resources used for data movement are related to an environment’s maturity. As the big data environment is defined, the emphasis is on data retrieval. This is because most big data applications must show a payback early on in order to be judged useful (or profitable). As the environment grows, the emphasis shifts to supporting the bulk data load rate. If loading one day’s worth of data into big data warehouse takes over 24 hours, it’s an issue!
As the big data environment matures, its survival depends upon its usefulness, which is measured by how the data are analyzed. Loaded data now covers multiple time periods, so query complexity and resource consumption can increase dramatically. And as old data is used less frequently, it must be purged or archived.
To manage this process, the DBA usually combines good database design and process design as follows:
The value of analytics (also called business intelligence, or BI) drives the need to create and maintain a big data environment. Imagine a huge datastore with simultaneous bulk loading and querying of data. Here, the DBA needs to be aware of data availability requirements as well as performance tuning.
Long-running extract-transform-load (ETL) jobs can lock important data during execution. To make data more available, the DBA can use an active/inactive table technique. Each critical structure is defined as two tables, or two partitions of the same table. One is designated active, the other inactive. Querying is directed to the active table while data loading is executed against the inactive one. After load is complete, the table definitions are switched.
One last idea worthy of mention is special-purpose software and hardware. Several hybrid data stores and query engines are available, including the IBM Smarter Analytics system.
The DBA addresses resource constraints in the big data environment through a balance of techniques:
Forrester report: Extract business value from social content
IBM white paper: Could your content be working harder—smarter?
And take advantage of open source InfoSphere Streams components
Podcast: Build a business case for real-time analytics
White paper: Deploy Hadoop to gain insights from mainframe data