Why Log Analytics is a Great—and Awful—Place to Start with Big Data

The pros and cons of learning Hadoop using structured data

First, let’s define what we mean by log analytics. The most common log analytics use case involves using Apache Hadoop to process machine-generated logs, which are typically clickstreams from Web applications or servers that support Web applications. Log analytics requires an ability to ingest a large amount of semi-structured information and then boil that information down into a more consumable data set that summarizes the important bits from the interactions. Log processing (for ad placement) is a core use case that Hadoop was invented to help with—so it’s no surprise that it functions well in this scenario.

Google, Yahoo, and a host of other Internet properties operate using business models that depend heavily on doing this and doing it well. Most companies, however, experience a delay (measured not in hours or days, but in weeks) between when a Web event happens and when they know about it based on their click or Weblog behavior. So it’s really not that hard to make things substantially better because the bar is set so low.

In addition, because most firms won’t decommission their existing log analytics system (which is often a third party that specializes in Web click analytics), log analysis using Hadoop can be a very low-risk way to get started with big data. It’s not mission-critical. Log analytics is not a use case where people will die or you put large amounts of capital at risk if you get it wrong.

For more traditional enterprises that are just getting started with log analytics, promoting a log processing use case is attractive to Hadoop vendors because it relies on non-critical data and, frankly, is not that hard to do. There is a low cost of failure and experimentation, it can be done in isolation of other production applications and job flows, and it can be done using the command line tools that come with generic Hadoop distributions. You don’t need to expose your experimentation or methods to the rest of the enterprise.

On the other hand…

Here’s the catch: Using Hadoop successfully to analyze log data is not a predictor of success in a typical enterprise scenario. The factors that make Hadoop a good fit for log analytics can mask what is required for real enterprise use and success. Log data is fairly structured. And while there may be a lot of it, it simply repeats—which is exactly why it is not an adequate test ground for data from a variety of sources and in a variety of structures.

The log analytics projects that I see most often are both static and non-predictive, so they are really log ETL jobs rather than analytics. There are no information lineage issues to deal with, and often there is a single source of information so it is assumed to be valid and you get a “pass” on data quality. Further, there typically aren’t governance issues that need to be considered (or governance measures aren’t enforced even when they are considered). There are generally no SLAs to meet, and frequently jobs often run overnight—so whether the job finishes at four o’clock or six o’clock in the morning doesn’t really have any impact on the use case.

These jobs require very little, if any, visualization—often because you’re just crunching the data and letting another system or manual job come and get it. There’s no need to test how easily Hadoop can be accessed by non-developers. There’s no connection with the rest of the business intelligence and reporting systems in the company. In other words, these projects are not a representative test of real-world success. They don’t use real-world data flows, and they usually don’t support a second and third use case on the same platform with the same skills.

To be clear, I am not saying that log analytics is not a valid use case. Nor am I arguing that is not a good way to learn about Hadoop. What I am saying is this: Do not assume that initial success with log analytics using Hadoop will translate to enterprise success in a broader deployment. Do not confuse success with what is essentially with an alternative way of doing a isolated, single-scope ETL free from data quality and SLA requirements; it is not a good predictor of what will work in your typical enterprise production environments.

What do you think? Are log analytics a good starting point for working with big data, or are they a lousy way to get going? Let me know your thoughts in the comments.

Previous post

Smarter consolidation into Hadoop platforms

Next post

Embracing Big Data From the Warehouse

Tom Deutsch

Tom Deutsch (Twitter: @thomasdeutsch) is chief technology officer (CTO) for the IBM Industry Solutions Group, and focuses on data science as a service. Tom played a formative role in the transition of Apache Hadoop–based technology from IBM Research to the IBM Software Group, and he continues to be involved with IBM Research's big data activities and the transition from research to commercial products. In addition, he created the IBM® InfoSphere® BigInsights™ Hadoop–based software, and he has spent several years helping customers with Hadoop, InfoSphere BigInsights, and InfoSphere Streams technologies by identifying architecture fit, developing business strategies, and managing early stage projects across more than 200 engagements. Tom came to IBM through the FileNet acquisition, where he had responsibility for FileNet’s flagship content management product and spearheaded FileNet product initiatives with other IBM software segments, including the Lotus and InfoSphere segments. Tom has also worked in the Information Management in the CTO’s office and with a team focused on emerging technology. He helped customers adopt innovative IBM enterprise mash-ups and cloud-based offerings. With more than 20 years of experience in the industry, and as a veteran of two startups, Tom is an expert on the technical, strategic, and business information management issues facing the enterprise today. Most of his work has been on emerging technologies and business challenges, and he brings a strong focus on the cross-functional work required to have early stage projects succeed. Tom has coauthored a book on big data and multiple thought-leadership papers. He earned a bachelor’s degree from Fordham University in New York and an MBA degree from the University of Maryland University College.

  • Amir Rahnama

    If Log Analycts is not a good place to start or as you said is not reflecting the overall success of an enterprise, then why big companies are continuing to use those analysis? In other way, if it is not representing the overall success, is there any clue of what is it representing ?

  • Amir Rahnama

    “Most companies, however, experience a delay (measured not in hours or days, but in weeks) between when a Web event happens and when they know about it based on their click or Weblog behavior”. Could you please elaborate more on this? Is this delay related to the time consuming process of log analyzing?

  • Michael Manoochehri

    Hi Tom: I think your article dances around the issue: Hadoop is not necessarily the right tool for the job of analyzing massive amounts of log data. If a company is trying to take on the practical challenge of analyzing massive amounts of raw server logs, why would they not use a tool that is a better suited for this task… i.e. Google BigQuery? (note: I work on the product team). When analyzing terabytes of log data, there is no reason why a query should result in a delay of hours or minutes, when tools exist that can return query results over log data in SECONDS.

    • Ellie Kesselman

      Michael: I just read your comment, smiled, clicked on your linked name, and smiled again. I agree, this is a slightly peculiar post that Tom wrote, though reading your comment has made me linger around longer than I would have otherwise.

      These are the salient points:
      ~ Hadoop can be used to process massive amounts of (boring) machine generated, relatively structured data. Despite being “raw” data from server logs, it is repetitive. This implied fixed field formats and records (I think…), similar to financial transactions data, or healthcare encounters
      ~ By starting with log data, one may avoid “custom”, and more costly, set-up’s e.g. Cloudera
      ~ Getting up to speed in Hadoop with log data is not portable knowledge for predictive analysis of unstructured data
      ~ Hadoop is not fast

      This leads me to wonder “Why not choose Google BigQuery if inclined to try something new?”
      Better yet… for risk-averse, budget-conscious types, stick with a nice z-system mainframe, tried and true, running JCL batch jobs overnight.

    • Ben Johnson

      I like what you guys have done with Google BigQuery. It’s an amazing piece of technology. However, many companies simply cannot (or will not) upload their internal data to a third party site like Google.

      Another issue is that BigQuery may not be flexible enough for analytics processing. It seems like it’s limited to SQL-style relational queries over columnar data whereas you might need to analyze user flow over multiple log events for a single user for risk analysis or a Google Analytics Flow-style visualization.

      [Full disclosure: I’m am a developer on an open source, big data analytics server]