Why is Schema on Read So Useful?
One question that keeps coming up in my conversations with customers—despite my best efforts to guide people away it—is “how much data do I need to use a big data solution?” As I’ve written previously, data sizing is usually a lousy way to choose whether to use big data technologies. While there are some cases—for example, if know you are going to have 6+ PB of data under management, as some of our customers do—where it makes sense to choose your technology based on data size, most big data projects are driven by the need for flexibility well before scale comes into the picture.
The flexibility of these systems has many dimensions, but I’d like to focus on one of the most important ones here: the idea of “schema on read.” The structure of the data is definitively determined before any data arrives for us, and we apply the schema to the data store at the time the data is written. Most of us are deeply familiar with schema on write, where we use a traditional (and still vital) relational database to store the data with a predetermined schema in mind, but we generally accept it as the only way to do things. So to help change how we look at this, let’s quickly remind ourselves of the pro/cons to this approach.
There are some non-trivial benefits to schema on write, including:
- In traditional data ecosystems, most tools (and people) expect schemas and can get right to work once the schema is described
- The approach is extremely useful in expressing relationships between data points
- It can be a very efficient way to store “dense” data
However, schema on write isn’t the answer to every problem. Downsides of this approach include:
- Schemas are typically purpose-built and hard to change
- Generally loses the raw/atomic data as a source
- Requires considerable modeling/implementation effort before being able to work with the data
- If a certain type of data can’t be confined in the schema, you can’t effectively store or use it (if you can store it at all)
- Unstructured and semi-structured data sources tend not to be a native fit
We’ve lived with these tradeoffs for a long time now, partially because there aren’t many good alternatives. The emergence of big data technologies poses an alternative—a schema on read approach—that changes the equation since it allows us more flexibility in matching the approach to the problem/maturity/nature of the patterns we are serving.
Schema on read is dramatically simpler up front: you just write the information to the data store. Unlike schema on write, which requires you to expend time and effort before loading the data, schema on read involves very little delay and you generally store the data at a raw or atomic level. In other words, you store what you get from the source systems—as it comes in from those systems. Schema on read means you can write your data first and then figure how you want to organize it later.
So why do it that way? The key drivers: flexibility and reuse. With a schema on write approach, it is hard to support applications, reporting, and analytics that don’t understand your schema, need changes to it, or have ad hoc usage patterns. With a schema on read approach, you define the schema at the time of interaction so it can be (with some constraints) pretty much anything you want or need it to be.
You may want to consider taking a schema on read approach for several reasons:
- Gives you massive flexibility over how the data can be consumed
- Your raw/atomic data can be stored for reference and consumption years into the future
- The approach promotes experimentation, since the cost of getting it “wrong” is so low
- Helps speed the time from data generation to availability
- Gives you flexibility to store unstructured, semi-structured, and/or loosely or unorganized data
But there are some drawbacks to schema on read too:
- Can be “expensive” in terms of compute resources (then again, these big data engines were built to handle that)
- The data is not self-documenting (i.e., you can’t look at a schema to figure out what the data is)
- You have to spend time creating the jobs that create the schema on read
One area where we see the advantages far outweighing the drawbacks is in environments where multiple LOBs all try to hit the same source systems for their own copy of the data. The schema on read approach involves having a data “landing zone” where the raw or atomic data is written out. After getting the data once, all the LOB systems make their schema on read requests against the landing zone. This —prevents the source systems from having to deal with all the LOB requests and provides a one-to-many approach of serving up data. We’ll talk about the landing zone pattern more in future columns.
Remember, no one approach works for all needs. I’d encourage you to add this topic to your Fit For Purpose discussions. Let me know what you think in the comments, and thanks as always for reading.