Big Data and Managing Resource Constraints
To most IT professionals, the term big data conjures up visions of hundreds of terabytes of data flowing through a network with the speed of a whirlwind while users attempt to execute complex queries with minimal elapsed time. This vision of big data as “volume, velocity, and variety” is an important one: it allows IT infrastructure management to break down the idea into parts that are more easily managed.
Database administration is an important component of the IT infrastructure. The DBA is responsible for data backup, recoverability, and performance tuning. A big data environment requires the DBA to look at things a bit differently than with more traditional systems.
Big data storage
The first aspect of big data is… lots of data! Large databases by themselves are a challenge to manage. However, DBAs must also consider data movement and data retrieval requirements.
For the DBA, the big data environment means that there is no time (and usually not enough disk or magnetic tape) for database reorgs or database backups. This also means that the databases themselves must be defined with these considerations in mind:
- Mass-Insert or bulk table load from pre-sorted data
- Minimal free space for tables and indexes (since the data is load-only with no update)
- Standard data extract and data transform processes and scripts
- Good object naming standards and a metadata catalog
Big data movement
The resources used for data movement are related to an environment’s maturity. As the big data environment is defined, the emphasis is on data retrieval. This is because most big data applications must show a payback early on in order to be judged useful (or profitable). As the environment grows, the emphasis shifts to supporting the bulk data load rate. If loading one day’s worth of data into big data warehouse takes over 24 hours, it’s an issue!
As the big data environment matures, its survival depends upon its usefulness, which is measured by how the data are analyzed. Loaded data now covers multiple time periods, so query complexity and resource consumption can increase dramatically. And as old data is used less frequently, it must be purged or archived.
To manage this process, the DBA usually combines good database design and process design as follows:
- Data is separated into logical or physical partitions by time period to support time-based analytical queries
- Time-based partitioning supports stale data purge by allowing the removal of a partition, rather than by costly and time-consuming SQL delete logic
- In its early stages, the big data environment can benefit from additional indexes to support query performance; later, some indexes may be removed to speed up bulk loads
The value of analytics (also called business intelligence, or BI) drives the need to create and maintain a big data environment. Imagine a huge datastore with simultaneous bulk loading and querying of data. Here, the DBA needs to be aware of data availability requirements as well as performance tuning.
Long-running extract-transform-load (ETL) jobs can lock important data during execution. To make data more available, the DBA can use an active/inactive table technique. Each critical structure is defined as two tables, or two partitions of the same table. One is designated active, the other inactive. Querying is directed to the active table while data loading is executed against the inactive one. After load is complete, the table definitions are switched.
One last idea worthy of mention is special-purpose software and hardware. Several hybrid data stores and query engines are available, including the IBM Smarter Analytics system.
A summary of big data resource constraints
The DBA addresses resource constraints in the big data environment through a balance of techniques:
- Storage constraints are addressed by minimizing free space, eliminating database reorgs, and reducing or eliminating backups
- Extended run times for bulk loads are reduced by intelligent data partitioning
- Query elapsed time constraints due to reduced data availability are reduced by using an active/inactive table technique