Technologies

Why is Schema on Read So Useful?

A primer on why flexibility—not scale—often drives big data adoption

One question that keeps coming up in my conversations with customers—despite my best efforts to guide people away it—is “how much data do I need to use a big data solution?” As I’ve written previously, data sizing is usually a lousy way to choose whether to use big data technologies. While there are some cases—for example, if know you are going to have 6+ PB of data under management, as some of our customers do—where it makes sense to choose your technology based on data size, most big data projects are driven by the need for flexibility well before scale comes into the picture.

The flexibility of these systems has many dimensions, but I’d like to focus on one of the most important ones here: the idea of “schema on read.” The structure of the data is definitively determined before any data arrives for us, and we apply the schema to the data store at the time the data is written. Most of us are deeply familiar with schema on write, where we use a traditional (and still vital) relational database to store the data with a predetermined schema in mind, but we generally accept it as the only way to do things. So to help change how we look at this, let’s quickly remind ourselves of the pro/cons to this approach.

There are some non-trivial benefits to schema on write, including:

  • In traditional data ecosystems, most tools (and people) expect schemas and can get right to work once the schema is described
  • The approach is extremely useful in expressing relationships between data points
  • It can be a very efficient way to store “dense” data

However, schema on write isn’t the answer to every problem. Downsides of this approach include:

  • Schemas are typically purpose-built and hard to change
  • Generally loses the raw/atomic data as a source
  • Requires considerable modeling/implementation effort before being able to work with the data
  • If a certain type of data can’t be confined in the schema, you can’t effectively store or use it (if you can store it at all)
  • Unstructured and semi-structured data sources tend not to be a native fit

We’ve lived with these tradeoffs for a long time now, partially because there aren’t many good alternatives. The emergence of big data technologies poses an alternative—a schema on read approach—that changes the equation since it allows us more flexibility in matching the approach to the problem/maturity/nature of the patterns we are serving.

Schema on read is dramatically simpler up front: you just write the information to the data store. Unlike schema on write, which requires you to expend time and effort before loading the data, schema on read involves very little delay and you generally store the data at a raw or atomic level. In other words, you store what you get from the source systems—as it comes in from those systems. Schema on read means you can write your data first and then figure how you want to organize it later.

So why do it that way? The key drivers: flexibility and reuse. With a schema on write approach, it is hard to support applications, reporting, and analytics that don’t understand your schema, need changes to it, or have ad hoc usage patterns. With a schema on read approach, you define the schema at the time of interaction so it can be (with some constraints) pretty much anything you want or need it to be.

You may want to consider taking a schema on read approach for several reasons:

  • Gives you massive flexibility over how the data can be consumed
  • Your raw/atomic data can be stored for reference and consumption years into the future
  • The approach promotes experimentation, since the cost of getting it “wrong” is so low
  • Helps speed the time from data generation to availability
  • Gives you flexibility to store unstructured, semi-structured, and/or loosely or unorganized data

But there are some drawbacks to schema on read too:

  • Can be “expensive” in terms of compute resources (then again, these big data engines were built to handle that)
  • The data is not self-documenting (i.e., you can’t look at a schema to figure out what the data is)
  • You have to spend time creating the jobs that create the schema on read

One area where we see the advantages far outweighing the drawbacks is in environments where multiple LOBs all try to hit the same source systems for their own copy of the data. The schema on read approach involves having a data “landing zone” where the raw or atomic data is written out. After getting the data once, all the LOB systems make their schema on read requests against the landing zone. This —prevents the source systems from having to deal with all the LOB requests and provides a one-to-many approach of serving up data. We’ll talk about the landing zone pattern more in future columns.

Remember, no one approach works for all needs. I’d encourage you to add this topic to your Fit For Purpose discussions. Let me know what you think in the comments, and thanks as always for reading.

 
Previous post

Dream Machines: Part 1

Next post

Big Data and Business Innovation

Tom Deutsch

Tom Deutsch (Twitter: @thomasdeutsch) is chief technology officer (CTO) for the IBM Industry Solutions Group, and focuses on data science as a service. Tom played a formative role in the transition of Apache Hadoop–based technology from IBM Research to the IBM Software Group, and he continues to be involved with IBM Research's big data activities and the transition from research to commercial products. In addition, he created the IBM® InfoSphere® BigInsights™ Hadoop–based software, and he has spent several years helping customers with Hadoop, InfoSphere BigInsights, and InfoSphere Streams technologies by identifying architecture fit, developing business strategies, and managing early stage projects across more than 200 engagements. Tom came to IBM through the FileNet acquisition, where he had responsibility for FileNet’s flagship content management product and spearheaded FileNet product initiatives with other IBM software segments, including the Lotus and InfoSphere segments. Tom has also worked in the Information Management in the CTO’s office and with a team focused on emerging technology. He helped customers adopt innovative IBM enterprise mash-ups and cloud-based offerings. With more than 20 years of experience in the industry, and as a veteran of two startups, Tom is an expert on the technical, strategic, and business information management issues facing the enterprise today. Most of his work has been on emerging technologies and business challenges, and he brings a strong focus on the cross-functional work required to have early stage projects succeed. Tom has coauthored a book on big data and multiple thought-leadership papers. He earned a bachelor’s degree from Fordham University in New York and an MBA degree from the University of Maryland University College.

  • Mark

    Like the comment around promoting experimentation, as it is so easy to ‘schema on read.’

  • http://twitter.com/TimBrown_IBM Tim J Brown

    Thanks Tom, excellent article. This is a critical discussion point and you’ve pulled it together in such a way that will help me in my future discussions.