Data Quality: You Are What You Eat
While for decades debates have raged about healthy eating and the diets supposed to enable this, over the last few years we have seen interest spike due to social media platforms such as Instagram. Wellness gurus and celebrity nutritionists exhort us to try this salad or that cleanse, instilling a mindset whereby you end up constantly questioning your food choices. Instead of asking ourselves how a dish will taste, we obsess about its nutritional content. While not always productive, it can be argued that organisations would do well do cultivate this vigilant mindset when it comes to data quality.
In his book Food Rules: An Eater’s Manual, journalist and author Michael Pollan proposes that eating well comes down to three rules — rules that also provide a useful guideline for organisations looking to improve data quality:
“Eat food. Not too much. Mostly plants.”
Eat Food
Pollan observes that a lot of the products that we consider food are not food in the most basic sense, e.g. Twinkies. As one of his rules states:
“Don’t eat anything your great‐grandmother wouldn’t recognize as food.”
This holds true for data as well. While technically any set of values can be considered data, some data is simply garbage. Examples are data that was collected in a faulty manner, has been corrupted, or does not serve any current or future purpose. Interestingly, truly random data does have value! Next time you are creating a data set, ask yourself, is this data with nutritional value? Will your algorithms be well fed… or are these data empty calories?
Not Too Much
In his book Pollan recites the popular adage that one should eat:
“Breakfast like a king, lunch like a prince, dinner like pauper.”
Other than a solid breakfast being a great way to start a day filled with data science, this is also worth keeping in mind when it comes to your analyses. Given the pressure on many organisations to “leverage big data” it can be tempting to collate massive data sets for every project. At worst this can lead to dangerous overfitting, and at best lead to wasted time and resources. Instead, be prudent when creating data sets, and realise that often more data will not lead to a more reliable output and investing in data quality is smarter.
Mostly Plants
Not only is eating less meat good for the planet, but our species evolved with a diet rich in plants, nuts, and grains. Because of this Pollan’s thinks we should:
“Eat like an omnivore.”
Similarly, organisations should think carefully about the data they ingest to make sure they are drawing upon the right types of data. For example, given the ease of accessing social metrics, it can be tempting to skew data sets and analyses in favour of this data. Like juicy burgers, we can get hooked on easily consumable data sources. However, discounting metrics from other channels can take an organisation away from its roots and damage its long-term prospects. All modern data in moderation, alongside your existing metrics.
While easily ignored given the day to day demands on analytics teams, being mindful about the data your organisation ingests yields significant benefits. With food, empty calories can make you feel sad and sluggish, and impact your brain’s ability to perform effectively. For organisations, improved data quality means being able to move faster with greater confidence and better sustained performance. This is before we even mention the many ethical and regulatory benefits, but that discussion will have to wait for a future post!
— Ryan