Follow us on Twitter! Add us on Facebook! Add our RSS feed!

Big Data: Data Quality’s Best Friend? Part 1

Why we don't see an inherent trade-off between the volume of a data set and the quality of the data maintained within it


  • Diane

    I could not agree more!! It is better with couple of gigs of useful data than TB’s of garbage. There are ways to get the large volumes of data cleansed but ideal is to stop the garbage at the source.


  • Pingback: Quora

  • Jim Walker


    This is a well done, realistic look at the intersection of big data and data quality. Much of the data that we see in our customers Hadoop implementations is of this sort. Machine generated data is much less likely to be of poor quality than data entered by the sales team in the field. Sales people enter duplicate records? ;)

    In the end, I have always claimed that data quality is a sliding scale. Where you land on that scale is driven by two things… a) your stomach (budget) for dealing with the issues and more importantly, b) the business problem you are trying to solve with the data. If I am just trying to find a needle in a haystack, how high does the quality need to be? If I am analyzing click stream data and joining that with my customer data to gain insight into trends and patterns (or profitability of web stores), the required level of quality is completely different.

    it IS about different problems. Great post!