Big Data Governance: A Framework to Assess Maturity
Markets today are abuzz with news, anecdotes, and rumors of the purported omnipresence and omniscience of big data. While marketers are busy formulating ways to monetize the vastness of available zettabytes, data scientists the world over are burning the midnight oil to harness new technologies (like streaming, Hadoop, and other NoSQL stores), commodity hardware, and cloud computing to literally transform the world.
Organizations see these technologies as game changers, especially since several of them support data in its native format without the need for transformation or modeling before questions can be asked. At this point in the big data lifecycle, organizations do not always know which data sources have value and they do not necessarily want to invest vast resources to gather requirements and sponsor formal information governance programs.
Clearly as the exploratory phase of big data “skunk works” projects drives business value and leads to formal initiatives, organizations will turn their attention to the fundamental questions within information management:
- Do we fully recognize the responsibilities associated with handling big data?
- How does big data change the traditional concept of information as a corporate asset?
- What are the emerging requirements around privacy?
- How do all these big data technologies relate to our current IT infrastructure?
All of this hoopla about big data raises more questions for the CIO than he or she may be prepared for. It is our experience that many organizations are justifying the lack of adequate governance policies because they believe that big data is “different” somehow, which we believe is side-stepping the issue. Simply stated, as big data technologies become operational—as opposed to exploratory—they need the same governance disciplines as traditional approaches to data management.
One of the first steps in the implementation of an information governance program is to assess the current state and the desired future state of maturity. According to Banu Ekiz, Vice President of Business Intelligence, Akbank Information Technologies Turkey, “Big data has all the characteristics of ‘small data’ when it comes to governance. The only difference is in the complexity and variety of channels that it comes from. Although there are greater demands on organizational energy and resources to govern big data, the gain in terms of business value is that much higher. Being able to analyze big data from the web, and take necessary actions, can have a major impact on a company’s profit. A maturity model for big data governance is a critical first step in this journey.”
We have leveraged the eleven categories of the IBM Information Governance Council Maturity Model (see figure). The following is a sample set of questions to assess the maturity of big data governance:
1. Business outcomes
- Have you identified the key business stakeholders for the big data governance program, e.g.:
- Marketing for social media governance
- Supply chain for RFID governance
- Legal for data retention policies
- Human resources for governance of employee-related social media
- Operations and maintenance for sensor data governance
- Billing for the governance of call detail records in telecommunications
- Medical informatics and claims administration for the governance of claims records in health insurance
- Have you quantified the financial benefits from big data governance, e.g.:
- Reduced risk of fines and lawsuits due to data breaches
- Lower exposure to credit events
- Avoiding negative impact on the brand due to bad publicity about the misuse of data
- Lower likelihood of paying for the same dataset like seismic data twice because of inconsistent nomenclature
- Increased cross-sell and up-sell opportunities due to integration of social media with the master data environment
- Less equipment downtime because the predictive maintenance program couples sensor data with consistent and high quality asset data
2. Organizational structures and awareness
- Do you have a defined scope for big data that applies to your organization?
- Big transaction data (e.g., healthcare claims, telecommunications call detail records, electronic medical records, call center agents’ notes)
- Web and social media data (e.g., Facebook, Twitter, LinkedIn)
- Machine-to-machine data (e.g., smart meter readings, oil rig sensors, telematics, RFID)
- Have you prioritized the types of big data that need to be governed?
- Have you extended the information governance charter to cover big data?
- Are there clear hand-offs between the teams responsible for big data repositories and traditional systems?
- Is big data governance included within the job description of key roles such as the chief data officer and the information governance officer?
- For emerging skills such as data scientists, are their roles clearly defined?
- Have any big data issues been addressed by the information governance council?
- Has the information governance council addressed the convergence of big data and master data (e.g., integrating social media into the customer master)?
- How will you address the stewardship of big data?
- Extend the job description of existing stewards (e.g., customer data steward covers social media)
- Appoint additional big data stewards (e.g., social media steward to deal with the unique privacy issues that are specific to that domain)
- Are their jobs and data manipulation tasks documented and repeatable?
- Will the data stewards be responsible for gathering input from legal, marketing, and other departments regarding the acceptable use of big data (e.g., integration of social media with master data management)?
- Have you built a Responsibility Assignment (RACI) matrix to define the roles and responsibilities for critical data elements?
- Are the data stewardship roles formalized with human resources?
- Are stewards empowered to define policies for retention of big data based on regulatory requirements and business imperatives? Are these policies consistent with those for traditional systems?
4. Data risk management
- Is risk management a key stakeholder for big data governance?
- Have you established the linkage between big data governance and risk management?
- Is there operationally realistic business continuity planning (technologies like Hadoop were not designed with traditional enterprise disaster recovery considerations)?
- Have you documented a set of policies for big data governance?
- Can these policies be inspected for enforcement?
- Have you translated these policies into a set of operational controls?
- Are you monitoring adherence to these operational controls using a Governance, Risk, and Compliance (GRC) framework? For example, an organization leverages social media data in its CRM environment. The organization has established a policy that requires the deletion of this data on a periodic basis to maintain customer privacy. The big data governance program needs to establish an operational control to ensure that this data is deleted on a periodic basis. The big data governance program can use a Governance, Risk, and Compliance (GRC) tool to document adherence to this policy.
6. Data quality management
- Do you have consensus on the quality issues associated with big data where the value of the data may or may not be high or obvious?
- Are your organizational policies on data quality being applied to both real-time (streaming) and at-rest (Hadoop) technologies?
- Do you address data quality directly in Hadoop?
- Do you use unstructured data to increase the quality of sparse data? For example, patients do not always mention that they are smokers during admission to a hospital. However, the predictive analytics team can use doctor’s notes, discharge summaries, and patient physicals to discern if an individual is a smoker to calculate the likelihood that they will be readmitted within 30 days after treatment for congestive heart failure.
- Have you considered the data quality issues relating to machine-to-machine communications (e.g., RFID readings can be error-prone depending on the read angle and in environments with high moisture content)?
- What are the dimensions of data quality that need to apply more to big data than traditional corporate data? For example, timeliness (time-stamping and accuracy) of machine log data is more critical for sensor data for high end machines and medical equipment.
- What are the dimensions of data quality that might be less applicable to big data (e.g., the accuracy of Twitter and Facebook data)?
- How are you inspecting the data quality issues in a repeatable and documented way?
7. Information lifecycle management
- What is the volume of storage for big data? What is the anticipated annual rate of growth?
- What is the cost of storage for big data? What is the anticipated annual rate of growth?
- Do you understand the regulatory requirements that govern the retention of big data? e.g.:
- Regulators may need to look at rig sensor data in case of an oil spill
- Accident investigators may need to review locomotive sensor data
- Do you understand the business requirements that drive the retention of big data (e.g., marketing may need a certain number of months of telephone call detail records to build churn models)?
- Have you extended the retention schedule to include big data?
- Does your retention schedule include the legal citations that drive retention of big data by country, state, and province?
- Have you created pointers from the retention schedule to the physical repositories for big data?
- Do you have a process to document legal holds on big data based on ongoing litigation?
- Do you have a process to defensibly dispose of big data that is no longer required based on regulations and business needs?
- Do you compress big data whether or not in Hadoop?
- Do you archive big data to reduce IT costs and improve application performance?
8. Information security and privacy
- Is the chief information security officer a key sponsor of the big data governance program?
- Do you understand privacy regulations that affect big data (especially social media) by country, state, and province?
- Have you established guidelines regarding the acceptable use of social media data for customers?
- Have you defined policies regarding the acceptable use of geolocation data for customers?
- Have you worked with human resources to establish policies regarding the usage of social media and geolocation data for employees and job candidates?
- Do you encrypt any sensitive big data in your production systems?
- Do you use unmasked sensitive big data within development, business intelligence, and test environments?
- Do you log and track user permissions with audit trail when leveraging your customer’s activity data on your website to build a comprehensive profile and product recommendations?
- Are you prepared to handle public relations and legal fallout from the advanced predictive capabilities of your recommendation engines, especially regarding gender and age sensitivities (e.g., a retailer promoting maternity products to a teenager when her parents were unaware of her impending pregnancy)?
According to Nina Vredevoogd, Manager, IT Planning & Program Management – Concur Technologies, “Big data is global. The concept of privacy, laws and regulations around data are not. For global companies, developing a comprehensive information management program and policies governing big data is imperative. Consumers are becoming more concerned about online privacy. Companies that adopt and actively market responsible policies to control access to consumer data, will likely gain competitive advantage in the rapidly growing world of online commerce.”
9. Data architecture
- What is the co-existence strategy for Hadoop, NoSQL, and other emerging big data technologies relative to your current architecture?
- Have you determined which applications should move into the big data infrastructure platform?
- Have you determined which applications should stay out of the big data infrastructure platform?
- How can our existing ETL tools move data in and out of the big data infrastructure platform?
- How do you leverage data compression and archiving technologies within the big data infrastructure platform?
- Are you considering the impact of master data on big data? e.g.:
- Customer master data – Leverage 10-K and 10-Q financial reports to update customer risk management hierarchies when ownership positions change
- Asset master data – If sensor data indicates that a pump is failing in one plant, then use consistent asset nomenclature to replace similar pumps at other plants
- Product master data – Consumer packaged good companies leverage detailed transaction data at the retail point of sale to drive analytics around which products to stock at what store location but these analytics will produce inconsistent results if different retailers use different nomenclature for the same product
- Are you considering the impact of reference data on big data (e.g., ICD-9 and ICD-10 codes for healthcare claims data)?
- Can you handle data quality in situ within the big data infrastructure platform without the need to create intermediate data structures?
- How do you handle the lineage of big data?
According to Jay Yusko, Ph.D, and Vice President of Technology Research at SymphonyIRI Group, “Information governance for big data is an absolute necessity. By its very nature, big data is developed from many disparate sources that need to be integrated to be useful as information that can be analyzed. To make this integration possible, the data from all the different sources need to be standardized with the same set of rules and then validated and monitored. This is really the heart of the information governance program for big data.”
10. Classification and metadata
- Does your organization-wide business terminology (business glossary) include key business terms relating to big data (e.g., “unique visitor” for clickstream data)?
- Has the business appointed data stewards to manage key business terms for big data?
- How frequently is the business and technical metadata refreshed or kept in sync across business units and IT?
- How do you handle the lineage of big data within the big data infrastructure platform?
- How do you handle impact analysis of big data within the big data infrastructure platform?
- Are you capturing key operational metadata to identify situations when big data is not loaded?
11. Audit information logging and reporting
- Do you have database administrators, contractors, and other third parties who possess unencrypted access to sensitive big data such as geolocation data, telephone call detail records, utility smart meter readings, and health claims?
In summary, organizations need to treat big data as an enterprise asset similar to other data types. As a general rule of thumb, if it is a governance consideration with a database or warehouse, then it is a governance consideration with big data technologies as well.
What do you think? Let us know in the comments.