Why Log Analytics is a Great—and Awful—Place to Start with Big Data
First, let’s define what we mean by log analytics. The most common log analytics use case involves using Apache Hadoop to process machine-generated logs, which are typically clickstreams from Web applications or servers that support Web applications. Log analytics requires an ability to ingest a large amount of semi-structured information and then boil that information down into a more consumable data set that summarizes the important bits from the interactions. Log processing (for ad placement) is a core use case that Hadoop was invented to help with—so it’s no surprise that it functions well in this scenario.
Google, Yahoo, and a host of other Internet properties operate using business models that depend heavily on doing this and doing it well. Most companies, however, experience a delay (measured not in hours or days, but in weeks) between when a Web event happens and when they know about it based on their click or Weblog behavior. So it’s really not that hard to make things substantially better because the bar is set so low.
In addition, because most firms won’t decommission their existing log analytics system (which is often a third party that specializes in Web click analytics), log analysis using Hadoop can be a very low-risk way to get started with big data. It’s not mission-critical. Log analytics is not a use case where people will die or you put large amounts of capital at risk if you get it wrong.
For more traditional enterprises that are just getting started with log analytics, promoting a log processing use case is attractive to Hadoop vendors because it relies on non-critical data and, frankly, is not that hard to do. There is a low cost of failure and experimentation, it can be done in isolation of other production applications and job flows, and it can be done using the command line tools that come with generic Hadoop distributions. You don’t need to expose your experimentation or methods to the rest of the enterprise.
On the other hand…
Here’s the catch: Using Hadoop successfully to analyze log data is not a predictor of success in a typical enterprise scenario. The factors that make Hadoop a good fit for log analytics can mask what is required for real enterprise use and success. Log data is fairly structured. And while there may be a lot of it, it simply repeats—which is exactly why it is not an adequate test ground for data from a variety of sources and in a variety of structures.
The log analytics projects that I see most often are both static and non-predictive, so they are really log ETL jobs rather than analytics. There are no information lineage issues to deal with, and often there is a single source of information so it is assumed to be valid and you get a “pass” on data quality. Further, there typically aren’t governance issues that need to be considered (or governance measures aren’t enforced even when they are considered). There are generally no SLAs to meet, and frequently jobs often run overnight—so whether the job finishes at four o’clock or six o’clock in the morning doesn’t really have any impact on the use case.
These jobs require very little, if any, visualization—often because you’re just crunching the data and letting another system or manual job come and get it. There’s no need to test how easily Hadoop can be accessed by non-developers. There’s no connection with the rest of the business intelligence and reporting systems in the company. In other words, these projects are not a representative test of real-world success. They don’t use real-world data flows, and they usually don’t support a second and third use case on the same platform with the same skills.
To be clear, I am not saying that log analytics is not a valid use case. Nor am I arguing that is not a good way to learn about Hadoop. What I am saying is this: Do not assume that initial success with log analytics using Hadoop will translate to enterprise success in a broader deployment. Do not confuse success with what is essentially with an alternative way of doing a isolated, single-scope ETL free from data quality and SLA requirements; it is not a good predictor of what will work in your typical enterprise production environments.
What do you think? Are log analytics a good starting point for working with big data, or are they a lousy way to get going? Let me know your thoughts in the comments.