Imagine you achieved the ideal scenario, you have gotten everyone on board to start using data better, to crunch data and make a more informed decision. With lots of excitement, you now look at where your data are “stored”.
There is a high chance that:
There are many many files with names that will never let you know what it contains for instance “Data 101022.csv”, etc.
They are stored in various folders, again folder names are not indicative of what is contained inside.
As you open up the files, some of the csv files, if you are lucky contains columns names but (as you guessed it) again the names might no be indicative of the values stored under the columns.
Given how haphazard it is, your trust in data collected start to drop. Mind you, I haven’t even come to the nightmare scenario that the data quality can be very atrocious for columns you have not done analysis with.
Such a scenario, I have very high confident that it will happen! How is that so?
If you have benefited much from the newsletter and will like to join my cause, consider making a “book" donation.
Link at the bottom of the newsletter or here. :)
The biggest reason why such a scenario happens is because data collection needs lots of planning, from naming conventions to determining the documentation is done and stored. Most companies just before they bang the table and say they want to use their data better are usually in “fighting fire” or “survival” mode, and using data for better decision making or automation is an after-thought rather. As such, data are just collected and stored haphazardly.
To collect good quality data and enable data cleaning scripts to be automated, planning how to collect good quality data is very important.
Companies need to look at the following areas to collect good quality data.
Naming Conventions - Columns, Dataset, Folder, Images & Documents (Unstructured)
Metadata - Definitions, Calculation & Formulas, Decimal Places, Sort Order
Data Quality - How to increase odds of collecting good quality data i.e. online forms vs written forms.
And many more…
Not forgetting find suitable tools to establish data management and governance processes etc. The planning and execution tasks can be daunting.
So what can companies do about it? My same advice like any other, get someone more experienced to help. Seek the persons opinion and feedback diligently. Document all possible details in a manner that is easily retrievable.
Planning is essential for data to be of good quality, plus it takes time to do it so START NOW if your company wants to move beyond survival mode and make more informed decision through data.
What are your thoughts? Share them in the comment below! Do give a “Like” if you found this informative! :)
It’s been my entire adult life. Glad to help.
I like this. I like anything about data management.
Quality, to me, means fit for use. In that way, there is no such thing as bad data, as all data can be useful for understanding something.
But not all data is fit for use, for some piece of automation. It takes data enrichment to get to that point. Think data refinery.
I work on language data, and networks, primarily. There is no bad language data. It can all be useful, depending on your level of skill and creativity.