IT'S ALL ABOUT THE DATA.
 Follow us on Twitter! Add us on Facebook! Add our RSS feed!

Why Log Analytics is a Great (and Awful) Place to Start with Big Data

The pros and cons of learning Hadoop using structured data

By

  • http://hippieitgeek.blogspot.com Amir Rahnama

    If Log Analycts is not a good place to start or as you said is not reflecting the overall success of an enterprise, then why big companies are continuing to use those analysis? In other way, if it is not representing the overall success, is there any clue of what is it representing ?

  • http://hippieitgeek.blogspot.com Amir Rahnama

    “Most companies, however, experience a delay (measured not in hours or days, but in weeks) between when a Web event happens and when they know about it based on their click or Weblog behavior”. Could you please elaborate more on this? Is this delay related to the time consuming process of log analyzing?

  • https://developers.google.com/bigquery/ Michael Manoochehri

    Hi Tom: I think your article dances around the issue: Hadoop is not necessarily the right tool for the job of analyzing massive amounts of log data. If a company is trying to take on the practical challenge of analyzing massive amounts of raw server logs, why would they not use a tool that is a better suited for this task… i.e. Google BigQuery? (note: I work on the product team). When analyzing terabytes of log data, there is no reason why a query should result in a delay of hours or minutes, when tools exist that can return query results over log data in SECONDS.

    • http://gooplex.wordpress.com/ Ellie Kesselman

      Michael: I just read your comment, smiled, clicked on your linked name, and smiled again. I agree, this is a slightly peculiar post that Tom wrote, though reading your comment has made me linger around longer than I would have otherwise.

      These are the salient points:
      ~ Hadoop can be used to process massive amounts of (boring) machine generated, relatively structured data. Despite being “raw” data from server logs, it is repetitive. This implied fixed field formats and records (I think…), similar to financial transactions data, or healthcare encounters
      ~ By starting with log data, one may avoid “custom”, and more costly, set-up’s e.g. Cloudera
      ~ Getting up to speed in Hadoop with log data is not portable knowledge for predictive analysis of unstructured data
      ~ Hadoop is not fast

      This leads me to wonder “Why not choose Google BigQuery if inclined to try something new?”
      OR
      Better yet… for risk-averse, budget-conscious types, stick with a nice z-system mainframe, tried and true, running JCL batch jobs overnight.

    • http://skylandlabs.com Ben Johnson

      I like what you guys have done with Google BigQuery. It’s an amazing piece of technology. However, many companies simply cannot (or will not) upload their internal data to a third party site like Google.

      Another issue is that BigQuery may not be flexible enough for analytics processing. It seems like it’s limited to SQL-style relational queries over columnar data whereas you might need to analyze user flow over multiple log events for a single user for risk analysis or a Google Analytics Flow-style visualization.

      [Full disclosure: I'm am a developer on an open source, big data analytics server]