By Tom Deutsch
By Nancy Kopp
By Paula Wiles Sigmon
By Joe Borges
By Stuart Litel
By Lester Knutsen
By James Kobielus
By Cristian Molaro
By Leon Katsnelson
By Susan Visser
By Bernie Spang
By the DB2 Guys
By Fred Ho
By Louis T. Cherian
By Shweta Shandilya
By Lawrence Weber
By Serge Rielau
By Dwaine Snow

Big data governance is part of a broader information governance program that formulates policy relating to the optimization, privacy, and monetization of big data by aligning the objectives of multiple functions. However, big data governance is meaningless without an understanding of the underlying data types.
Figure 1. A three-dimensional framework for big data governance
This article provides a framework for big data governance. As shown in Figure 1, the framework consists of three dimensions:
As mentioned above, big data falls into five categories:
A big data framework looks different depending on industry and function.
Solution: Patient monitoring
Big data type: M2M data
Disciplines: Data quality, information lifecycle management, privacy
A hospital leveraged streaming analytics technologies to monitor the health of newborn babies in the neonatal intensive care unit. Using these technologies, the hospital was able to predict the onset of disease a full 24 hours before any symptoms appeared. These technologies depended on large volumes of time series data—but this data was sometimes missing when a patient moved, which caused the lead to disengage and stop providing readings. In these situations, the streaming platform used linear and polynomial regressions to use historical readings to fill in the gaps in the time series data. The hospital also tagged all time series data that had been modified by software algorithms. In case of a lawsuit or medical inquiry, the hospital felt that it had to produce both the original and modified readings. Plus, the hospital established policies around safeguarding protected health information.
Solution: Predictive modeling based on electronic medical records
Big data type: Human-generated data
Discipline: Data quality
The analytics department at a hospital built a predictive model based on 150 variables and 20,000 patient encounters to determine the likelihood that a patient would be readmitted within 30 days of treatment for congestive heart failure. In one example of the predictive model’s effectiveness, the analytics team identified the patient’s smoking status as a critical variable. At first, only 25 percent of the structured data around smoking status was populated with binary yes/no answers. However, the analytics team increased the population rate for smoking status to 85 percent of the encounters by using content analytics based on electronic medical records containing doctor’s notes, discharge summaries, and patient physicals—enabling the analytics team to improve the quality of sparsely populated structured data by using unstructured data sources.
Solution: Claims analytics
Big data type: Big transaction data
Discipline: Data quality
A large health plan processes over 500 million claims per year, with each claims record consisting of 600 to 1,000 attributes. The plan uses predictive analytics to determine whether certain proactive measures were required for a small subset of members. However, the business intelligence team found that physicians were using inconsistent procedure codes to submit claims, which limited the effectiveness of the predictive analytics. The business intelligence team also questioned the text within claims documents. For example, the team used terms such as “chronic congestion” and “blood-sugar monitoring” to determine that those members might be candidates for disease management programs for asthma and diabetes, respectively.
Solution: Smart meters
Big data type: M2M data
Disciplines: Privacy, information lifecycle management
Several utilities are rolling out smart meters to measure the consumption of water, gas, and electricity at regular intervals of one hour or less. These smart meters generate copious amounts of interval data that need to be governed appropriately. Utilities must safeguard the privacy of this interval data because it can potentially reveal a subscriber’s household activities as well as when a homeowner might be away. In addition, utilities need to establish policies for the archival and deletion of interval data to reduce storage costs.
Solution: Facebook loyalty app
Big data type: Web and social media
Disciplines: Privacy, master data integration, organization
A retailer’s marketing department might want to use master data on customers, products, employees, and store locations to enrich its Facebook app. The success of the Facebook app depends on a strong foundation of master data management (MDM) and policies around social media governance. In one example, the retailer would need to adhere to the Facebook Platform Policies by not using data on a customer’s friends outside of the context of the app, as marketing and social media stewards have agreed on a consistent set of identifiers to link a customer’s Facebook profile with his or her MDM record. Finally, the retailer needs to establish a robust product hierarchy to enable product comparisons. For instance, the retailer would need to know that a customer who purchased a “Whirlpool GX5FHDXVY” already has a product in the “refrigerator” hierarchy.
Solution: Personalized messaging based on facial recognition and social media
Big data type: Web and social media, biometrics
Disciplines: Privacy, business process integration
A March 2012 report from the U.S. Federal Trade Commission details how retailers could potentially use facial recognition technology in combination with a photo from social media to make personalized offers to customers based on their buying behavior and location. While this information could have a tremendous impact on retailers’ loyalty programs, it would also have serious privacy ramifications. Retailers would need to make the appropriate privacy disclosures before implementing these applications.
Solution: Customer churn analytics
Big data type: Web and social media, big transaction data
Disciplines: Privacy, master data integration
Telecommunications operators build detailed customer churn models that include social media and big transaction data such as CDRs. However, the overall value of the churn models also depends on the quality of traditional attributes of customer master data such as date of birth, gender, location, and income. A large operator wanted to implement a predictive analytics strategy around churn management. Analyzing subscribers’ calling patterns has proven to be an effective way to predict churn, so the operator decided that it would outsource its churn analytics to an overseas vendor. Because these CDRs had to be shipped to the vendor each day, there was significant concern over safeguarding the privacy of customer data. After the appropriate deliberation, the operator decided to mask sensitive data such as subscriber name because the calling and receiving telephone numbers were the primary fields of value for churn analytics.
Solution: Claims investigation, underwriting
Big data type: Web and social media
Disciplines: Privacy, business process integration
Many insurance carriers now use social media to investigate claims. However, most regulators still do not permit insurers to use social media to set policy rates during the underwriting process. For example, if a life insurer sees that an applicant’s Facebook profile indicates that she is a student pilot, the insurer cannot use that knowledge to increase her premiums because she might be considered a high risk.
Solution: Vehicle telematics
Big data type: M2M data
Discipline: Information lifecycle management
An insurer instituted a pilot program that offered lower rates to policyholders in exchange for the ability to put on-board sensors on motor vehicles. These sensors gathered telematics data to monitor the driving behavior of policyholders. Overwhelmed with a large amount of data, the insurer had to establish a policy regarding the retention period for telematics data.
Solution: Risk management
Big data type: Web and social media (web content)
Discipline: Master data integration
Risk management departments need to update their customer hierarchies, all of which depend on the most current financial information. For example, when Tata Motors acquired Jaguar, the risk management department had to update the risk hierarchy for Tata Motors to also include any exposure to Jaguar. In another example, a bank developed an economic hierarchy to aggregate its overall exposure to a car manufacturer, its tier 1 and tier 2 suppliers, and the employees of the manufacturer and its suppliers. The risk management department could update its economic hierarchy in the event of consolidation between suppliers, or use big data technologies to comb through unstructured financial information such as U.S. Securities and Exchange Commission 10K and 10Q filings to dynamically update changes in company ownership structures within its MDM hierarchies.
Solution: Credit, collections
Big data type: Web and social media
Discipline: Privacy
Banks follow regulations such as the United States Fair Credit Reporting Act when using social media for credit decisions. In addition, collections departments must adhere to regulations such as the United States Fair Debt Collection Practices Act, which are designed to prevent collectors from harassing debtors or infringing upon their privacy, including within social media.
Solution: Preventive maintenance
Big data type: M2M data
Disciplines: Data quality, information lifecycle management, business process integration, master data integration, metadata
Sensors on a modern train record more than 1,000 different types of mechanical and electrical events. These include operational events such as “opening door” or “train is braking,” warning events such as “line voltage frequency is out of range” or “compression is low in compressor X,” and failure events such as “pantograph is out of order” or “inverter lockout.” The preventive maintenance team uses predictive models to identify events that are highly correlated with preceding events. Consider an example where failure event 1245 is preceded by warning event 2389 90 percent of the time. In this example, the operations team must issue a work order for preventive maintenance whenever warning event 2389 is logged into the system. If the railroad has trains in its fleet from different manufacturers, sensors on different trains might generate different numerical codes for the same event. If a particular part failed on one train, the operations department might want to inspect similar parts on other trains, which would be difficult if the same part has different names across trains. Retention of sensor data that is driven by safety regulations is another consideration.
Solution: Call monitoring
Big data type: Human-generated
Discipline: Privacy
Customer service departments analyze voice recordings to improve operational efficiency and to support agent training. Before using this data, customer service departments should mask the portions of the voice recordings that contain sensitive information such as social security number, account number, name, and address.
Solution: Log analytics
Big data type: M2M data
Discipline: Metadata
IT departments are turning to big data to analyze application logs for slivers of insight that can improve system performance. Because application vendors’ log files are in different formats, they need to be standardized before IT departments can use them.
Solution: Sentiment analysis
Big data type: Web and social media
Disciplines: Master data integration, data quality, privacy
Marketing departments use Twitter feeds to conduct sentiment analysis that helps an organization determine what users are saying about the company and its products or services—for example, the analytics team needs to determine if references to “@Acme” and “Acme” refer to “Acme Corporation.” Integration of sentiment analysis with a customer’s profile can also be challenging, because in addition to privacy issues, the Twitter handle reveals the user name only in 50 to 60 percent of cases. Plus, marketing might need to answer the following question: “Do we really believe that Twitter sentiment analysis is representative if users are younger and more affluent than our usual customers?”
Solution: Network analytics
Big data type: M2M data
Discipline: Metadata
Security Information and Event Management (SIEM) tools aggregate log data from systems, applications, network elements, and security devices across the enterprise. It is highly likely that the log files from two network elements will refer to the same event using different codes. Security professionals need to normalize these event codes before using SIEM analytics.
Organizations will be successful in governing their big data if they adopt a framework that covers the appropriate types of big data, the information governance disciplines, and the specific use cases for their industry and function.
IBM Big Data, Integration and Governance 2013 Forums
Attend an event near you to learn how leading organizations are making sense of massive amounts and new types of information to create value
DB2 TechTalk: Deep Dive on BLU Acceleration in DB2 10.5, Super Analytics Super Easy
Thursday, May 30: 12:30 – 2:00 PM ET
Informix Chat with the Lab: Primary Storage Manager (PSM) a Parallel Backup Alternative to Ontape
Thursday, May 30: 11:30 – 1 PM ET
Big Data Executive Summit
June 7 (Dallas) and June 10 (San Francisco)
Big Data Seminar 2013, Featuring Krish Krishnan
June 14 in New York City
Hadoop Summit North America
June 26-27
Big Data and the Enterprise: A Perspective from Featured Gartner Analyst Donald Feinberg
July 11: 11AM ET
marcus evans Pharma Data Analytics Conference
July 10-11 in Philadelphia
IBM Smarter Content Summit 2013
Register now!
Big Data at the Speed of Business
Broadcast event replay now available
Information on Demand 2013: Early Bird Registration Now Open
November 3-7 in Las Vegas
Pingback: Big Data Reference Architecture | sunilsoares