Big Data and Warehousing, Integration and Governance

A Framework that Focuses on the “Data” in Big Data Governance

Big data types, information governance disciplines, industries, and functions

Big data governance is part of a broader information governance program that formulates policy relating to the optimization, privacy, and monetization of big data by aligning the objectives of multiple functions. However, big data governance is meaningless without an understanding of the underlying data types.

Figure 1. A three-dimensional framework for big data governance

 

This article provides a framework for big data governance. As shown in Figure 1, the framework consists of three dimensions:

  • Big data typesBig data can be classified into five types: web and social media, machine-to-machine (M2M), big transaction data, biometrics, and human-generated.
  • Information governance disciplinesThe traditional disciplines of information governance—organization, metadata, privacy, data quality, business process integration, master data integration, and information lifecycle management—also apply to big data. For example, sensor data needs to be integrated into a preventive maintenance process. However, if sensors from different machines generate inconsistent event codes, it will be difficult to streamline the maintenance process.
  • Industries and functions
    Big data analytics are driven by use cases that are specific to a given industry or function such as marketing, customer service, information security, or information technology.

As mentioned above, big data falls into five categories:

  1. Web and social media data includes clickstream and interaction data from social media such as Facebook, Twitter, LinkedIn, and blogs.
  2. Machine-to-machine data includes readings from sensors, meters, and other devices as part of the so-called “Internet of things.”
  3. Big transaction data includes healthcare claims, telecommunications call detail records (CDRs), and utility billing records that are increasingly available in semi-structured and unstructured formats.
  4. Biometric data includes fingerprints, genetics, handwriting, retinal scans, and similar types of data.
  5. Human-generated data includes vast quantities of unstructured and semi-structured data such as call center agents’ notes, voice recordings, email, paper documents, surveys, and electronic medical records.

A big data framework looks different depending on industry and function.

 

Healthcare providers

Solution:          Patient monitoring

Big data type:  M2M data

Disciplines:     Data quality, information lifecycle management, privacy

A hospital leveraged streaming analytics technologies to monitor the health of newborn babies in the neonatal intensive care unit. Using these technologies, the hospital was able to predict the onset of disease a full 24 hours before any symptoms appeared. These technologies depended on large volumes of time series data—but this data was sometimes missing when a patient moved, which caused the lead to disengage and stop providing readings. In these situations, the streaming platform used linear and polynomial regressions to use historical readings to fill in the gaps in the time series data. The hospital also tagged all time series data that had been modified by software algorithms. In case of a lawsuit or medical inquiry, the hospital felt that it had to produce both the original and modified readings. Plus, the hospital established policies around safeguarding protected health information.

 

Solution:          Predictive modeling based on electronic medical records

Big data type: Human-generated data

Discipline:       Data quality

The analytics department at a hospital built a predictive model based on 150 variables and 20,000 patient encounters to determine the likelihood that a patient would be readmitted within 30 days of treatment for congestive heart failure. In one example of the predictive model’s effectiveness, the analytics team identified the patient’s smoking status as a critical variable. At first, only 25 percent of the structured data around smoking status was populated with binary yes/no answers. However, the analytics team increased the population rate for smoking status to 85 percent of the encounters by using content analytics based on electronic medical records containing doctor’s notes, discharge summaries, and patient physicals—enabling the analytics team to improve the quality of sparsely populated structured data by using unstructured data sources.

 

Health plans

Solution:          Claims analytics

Big data type: Big transaction data

Discipline:       Data quality

A large health plan processes over 500 million claims per year, with each claims record consisting of 600 to 1,000 attributes. The plan uses predictive analytics to determine whether certain proactive measures were required for a small subset of members. However, the business intelligence team found that physicians were using inconsistent procedure codes to submit claims, which limited the effectiveness of the predictive analytics. The business intelligence team also questioned the text within claims documents. For example, the team used terms such as “chronic congestion” and “blood-sugar monitoring” to determine that those members might be candidates for disease management programs for asthma and diabetes, respectively.

 

Utilities

Solution:          Smart meters

Big data type: M2M data

Disciplines:     Privacy, information lifecycle management

Several utilities are rolling out smart meters to measure the consumption of water, gas, and electricity at regular intervals of one hour or less. These smart meters generate copious amounts of interval data that need to be governed appropriately. Utilities must safeguard the privacy of this interval data because it can potentially reveal a subscriber’s household activities as well as when a homeowner might be away. In addition, utilities need to establish policies for the archival and deletion of interval data to reduce storage costs.

 

Retail

Solution:          Facebook loyalty app

Big data type: Web and social media

Disciplines:     Privacy, master data integration, organization

A retailer’s marketing department might want to use master data on customers, products, employees, and store locations to enrich its Facebook app. The success of the Facebook app depends on a strong foundation of master data management (MDM) and policies around social media governance. In one example, the retailer would need to adhere to the Facebook Platform Policies by not using data on a customer’s friends outside of the context of the app, as marketing and social media stewards have agreed on a consistent set of identifiers to link a customer’s Facebook profile with his or her MDM record. Finally, the retailer needs to establish a robust product hierarchy to enable product comparisons. For instance, the retailer would need to know that a customer who purchased a “Whirlpool GX5FHDXVY” already has a product in the “refrigerator” hierarchy.

 

Solution:          Personalized messaging based on facial recognition and social media

Big data type: Web and social media, biometrics

Disciplines:     Privacy, business process integration

A March 2012 report from the U.S. Federal Trade Commission details how retailers could potentially use facial recognition technology in combination with a photo from social media to make personalized offers to customers based on their buying behavior and location. While this information could have a tremendous impact on retailers’ loyalty programs, it would also have serious privacy ramifications. Retailers would need to make the appropriate privacy disclosures before implementing these applications.

 

Telecommunications

Solution:          Customer churn analytics

Big data type: Web and social media, big transaction data

Disciplines:     Privacy, master data integration

Telecommunications operators build detailed customer churn models that include social media and big transaction data such as CDRs. However, the overall value of the churn models also depends on the quality of traditional attributes of customer master data such as date of birth, gender, location, and income. A large operator wanted to implement a predictive analytics strategy around churn management. Analyzing subscribers’ calling patterns has proven to be an effective way to predict churn, so the operator decided that it would outsource its churn analytics to an overseas vendor. Because these CDRs had to be shipped to the vendor each day, there was significant concern over safeguarding the privacy of customer data. After the appropriate deliberation, the operator decided to mask sensitive data such as subscriber name because the calling and receiving telephone numbers were the primary fields of value for churn analytics.

 

Insurance

Solution:          Claims investigation, underwriting

Big data type: Web and social media

Disciplines:     Privacy, business process integration

Many insurance carriers now use social media to investigate claims. However, most regulators still do not permit insurers to use social media to set policy rates during the underwriting process. For example, if a life insurer sees that an applicant’s Facebook profile indicates that she is a student pilot, the insurer cannot use that knowledge to increase her premiums because she might be considered a high risk.

 

Solution:          Vehicle telematics

Big data type: M2M data

Discipline:       Information lifecycle management

An insurer instituted a pilot program that offered lower rates to policyholders in exchange for the ability to put on-board sensors on motor vehicles. These sensors gathered telematics data to monitor the driving behavior of policyholders. Overwhelmed with a large amount of data, the insurer had to establish a policy regarding the retention period for telematics data.

 

Banking

Solution:          Risk management

Big data type: Web and social media (web content)

Discipline:       Master data integration

Risk management departments need to update their customer hierarchies, all of which depend on the most current financial information. For example, when Tata Motors acquired Jaguar, the risk management department had to update the risk hierarchy for Tata Motors to also include any exposure to Jaguar. In another example, a bank developed an economic hierarchy to aggregate its overall exposure to a car manufacturer, its tier 1 and tier 2 suppliers, and the employees of the manufacturer and its suppliers. The risk management department could update its economic hierarchy in the event of consolidation between suppliers, or use big data technologies to comb through unstructured financial information such as U.S. Securities and Exchange Commission 10K and 10Q filings to dynamically update changes in company ownership structures within its MDM hierarchies.

 

Solution:          Credit, collections

Big data type: Web and social media

Discipline:       Privacy

Banks follow regulations such as the United States Fair Credit Reporting Act when using social media for credit decisions. In addition, collections departments must adhere to regulations such as the United States Fair Debt Collection Practices Act, which are designed to prevent collectors from harassing debtors or infringing upon their privacy, including within social media.

 

Railroads

Solution:          Preventive maintenance

Big data type: M2M data

Disciplines:     Data quality, information lifecycle management, business process integration, master data integration, metadata

Sensors on a modern train record more than 1,000 different types of mechanical and electrical events. These include operational events such as “opening door” or “train is braking,” warning events such as “line voltage frequency is out of range” or “compression is low in compressor X,” and failure events such as “pantograph is out of order” or “inverter lockout.” The preventive maintenance team uses predictive models to identify events that are highly correlated with preceding events. Consider an example where failure event 1245 is preceded by warning event 2389 90 percent of the time. In this example, the operations team must issue a work order for preventive maintenance whenever warning event 2389 is logged into the system. If the railroad has trains in its fleet from different manufacturers, sensors on different trains might generate different numerical codes for the same event. If a particular part failed on one train, the operations department might want to inspect similar parts on other trains, which would be difficult if the same part has different names across trains. Retention of sensor data that is driven by safety regulations is another consideration.

 

Customer service

Solution:          Call monitoring

Big data type:  Human-generated

Discipline:       Privacy

Customer service departments analyze voice recordings to improve operational efficiency and to support agent training. Before using this data, customer service departments should mask the portions of the voice recordings that contain sensitive information such as social security number, account number, name, and address.

 

Information technology

Solution:          Log analytics

Big data type: M2M data

Discipline:       Metadata
IT departments are turning to big data to analyze application logs for slivers of insight that can improve system performance. Because application vendors’ log files are in different formats, they need to be standardized before IT departments can use them.

 

Marketing

Solution:          Sentiment analysis

Big data type: Web and social media

Disciplines:     Master data integration, data quality, privacy

Marketing departments use Twitter feeds to conduct sentiment analysis that helps an organization determine what users are saying about the company and its products or services—for example, the analytics team needs to determine if references to “@Acme” and “Acme” refer to “Acme Corporation.” Integration of sentiment analysis with a customer’s profile can also be challenging, because in addition to privacy issues, the Twitter handle reveals the user name only in 50 to 60 percent of cases. Plus, marketing might need to answer the following question: “Do we really believe that Twitter sentiment analysis is representative if users are younger and more affluent than our usual customers?”

 

Information security

Solution:          Network analytics

Big data type: M2M data

Discipline:       Metadata

Security Information and Event Management (SIEM) tools aggregate log data from systems, applications, network elements, and security devices across the enterprise. It is highly likely that the log files from two network elements will refer to the same event using different codes. Security professionals need to normalize these event codes before using SIEM analytics.

 

Conclusion

Organizations will be successful in governing their big data if they adopt a framework that covers the appropriate types of big data, the information governance disciplines, and the specific use cases for their industry and function.

Previous post

Is Your Big Data Hot, Warm, or Cold?

Next post

Business Intelligence in the Hadoop Era

Sunil Soares

Sunil Soares is the founder and managing partner of Information Asset, LLC, a consulting firm that specializes in helping organizations build out their information governance programs. Prior to this role, Sunil was the Director of Information Governance at IBM, and worked with clients across six continents and multiple industries.

Sunil has published a book called The IBM Data Governance Unified Process that details the fourteen steps and almost one hundred sub-steps to implement an information governance program. The book is currently in its second print and has also been translated into Chinese.

Sunil’s second book, Selling Information Governance to the Business: Best Practices by Industry and Job Function, reviews the best way to approach information governance by industry and function.

His third book, Big Data Governance, will review the importance of information governance for different types of big data such as social media, machine-to-machine, big transaction data, biometrics, and human generated data. Sunil has also worked at the Financial Services Strategy Consulting Practice of Booz Allen & Hamilton in New York. He lives in New Jersey and holds an MBA in Finance and Marketing from the University of Chicago Booth School of Business.

  • Luis

    I do research about Port and Logistics from a Korean University. And I am wonder if you have such kind of ideas how to design a framework for Big Data Analytics for Port and Logistics Industry?

  • http://www.information-asset.com Sunil Soares

    Thanks Sumanda, Sorry it took so long to respond to you. I agree with your comments about the Z-axis. That was the intent of the graphic. You want to pick your industry/function (Y axis), then your big data types (X axis), and, finally, the information governance disciplines (Z axis)…Sunil

  • Pingback: Big Data Reference Architecture | sunilsoares()

  • http://www.perfectsearchcorp.com Scott Livingston

    A well-written, thoughtful article on both the challenges and opportunities in the “Big Data” world we are now living in.

    Key takeaways: More and more of the data being created is unstructured data. Being able to quickly search that unstructured data is critical for a company’s competitiveness.

    Therefore, finding a search company that can effectively and affordably index and then query data is a mission-critical activity.

    • http://www.information-asset.com Sunil Soares

      Scott, sorry it took me so long to respond. Thanks for the comments…Sunil

  • http://sumandabasu.wordpress.com Sumanda Basu

    Sunil, this is a good article. I have a comment on Figure 1. I think, Information Governance axis will be on the 3rd dimension (z axis) of the cube, and need to be renamed to Information Governance Discipline. I think it will be more useful to have capability (within a context) in Y axis, in stead of industries / functions.

    Sumanda