One question that keeps coming up in my conversations with customers—despite my best efforts to guide people away it—is “how much data do I need to use a big data solution?” As I’ve written previously, data sizing is usually a lousy way to choose whether to use big data technologies. While there are some cases—for example, if know you are going to have 6+ PB of data under management, as some of our customers do—where it makes sense to choose your technology based on data size, most big data projects are driven by the need for flexibility well before scale comes into the picture.
The flexibility of these systems has many dimensions, but I’d like to focus on one of the most important ones here: the idea of “schema on read.” The structure of the data is definitively determined before any data arrives for us, and we apply the schema to the data store at the time the data is written. Most of us are deeply familiar with schema on write, where we use a traditional (and still vital) relational database to store the data with a predetermined schema in mind, but we generally accept it as the only way to do things. So to help change how we look at this, let’s quickly remind ourselves of the pro/cons to this approach.
There are some non-trivial benefits to schema on write, including:
However, schema on write isn’t the answer to every problem. Downsides of this approach include:
We’ve lived with these tradeoffs for a long time now, partially because there aren’t many good alternatives. The emergence of big data technologies poses an alternative—a schema on read approach—that changes the equation since it allows us more flexibility in matching the approach to the problem/maturity/nature of the patterns we are serving.
Schema on read is dramatically simpler up front: you just write the information to the data store. Unlike schema on write, which requires you to expend time and effort before loading the data, schema on read involves very little delay and you generally store the data at a raw or atomic level. In other words, you store what you get from the source systems—as it comes in from those systems. Schema on read means you can write your data first and then figure how you want to organize it later.
So why do it that way? The key drivers: flexibility and reuse. With a schema on write approach, it is hard to support applications, reporting, and analytics that don’t understand your schema, need changes to it, or have ad hoc usage patterns. With a schema on read approach, you define the schema at the time of interaction so it can be (with some constraints) pretty much anything you want or need it to be.
You may want to consider taking a schema on read approach for several reasons:
But there are some drawbacks to schema on read too:
One area where we see the advantages far outweighing the drawbacks is in environments where multiple LOBs all try to hit the same source systems for their own copy of the data. The schema on read approach involves having a data “landing zone” where the raw or atomic data is written out. After getting the data once, all the LOB systems make their schema on read requests against the landing zone. This —prevents the source systems from having to deal with all the LOB requests and provides a one-to-many approach of serving up data. We’ll talk about the landing zone pattern more in future columns.
Remember, no one approach works for all needs. I’d encourage you to add this topic to your Fit For Purpose discussions. Let me know what you think in the comments, and thanks as always for reading.
IBM big data in a minute: Bringing the power of Hadoop to the enterprise
Video: The right tool for the job
Nature of analytics video: IBM and the swan of all fears
IBM redesigns its Big Data & Analytics website with IBM Watson Foundations capabilities
Visit a website with comprehensive resources dedicated to the chief data officer role
Podcast: Learn about the InfoSphere Streams project at GitHub