Basically, the following should be considered in the case of big data: not all of these criteria apply to big data, and not all of them are 100% achievable.
The problem with consistency is that the specific characteristics of big data even allow "noise". The large amount and structure of big data make it difficult to erase them all. Sometimes it's even unnecessary. In some cases, however, there must be logical relationships in your big data. For example: when a bank's Big Data Tool detects potential fraud (e.g. that your card was used in Cambodia while you were in Arizona). The Big Data Tool monitors your social networks. And it can check whether you are vacationing in Cambodia. In other words, it connects information about you from different data sets and therefore requires a certain level of consistency (a precise link between your bank account and your social network accounts).
But while opinions about a particular product are collected on social networks, duplications and contradictions are acceptable. Some people have multiple accounts and use them at different times. In the first case, they say that they like the product and in the second - that they hate it. Why is it ok? Because that doesn't affect the results of your data analytics
on a large scale.
As for accuracy, we mentioned in the article that the level varies from task to task. Imagine a situation: you have to analyze the information from the past month and the data disappears for 2 days. Without this data, you cannot calculate exact numbers. And when we talk about news in a TV ad, it's not that critical: we can still calculate monthly averages and trends without them. But when the situation is more serious and demanding calculations or thoroughly detailed historical records are needed (as in the case with the heart monitor), inaccurate data can lead to wrong decisions and even more errors.
Completeness is no reason to worry too much, because big data naturally has many gaps. But it's OK. As in the case of the missing data for 2 days, we can still have sufficient analysis results due to the large amount of other similar data. The overall picture will still look adequate even without this shabby part.
As far as verifiability is concerned, big data offers several options. If you want to check the quality of your big data, you can do that. Even if your company needs time and resources for this. For example, to create scripts that will check the data quality and to run these scripts, which can be expensive due to the large amount of data.
And now for the criterion of order. You should be ready for a degree of "controllable chaos" in your data. For example: Data lakes usually don't pay much attention to the data structure and the appropriate value. They only save what they get. Before the data is loaded into big data warehouses, however, a clean-up process is usually carried out, which can partially ensure that your data is correct. But only partially.
Will you stay "dirty" or go "clean"?
As you can see, none of these big criteria for data quality is strict or suitable for all cases. And adapting your big data solution to fully meet all of these requirements:
• costs a lot
• needs a lot of time
• reduces the performance of your system
• is completely impossible
This is the reason why some companies neither chase the clean data nor stay with the dirty data. They use data that is "good enough”. This means that they set a minimum satisfactory threshold that provides them with sufficient analysis results. And then they make sure that their data quality is always above that.
How can the quality of big data be improved?
There are three rules of thumb that you should follow when deciding on your Big Data quality policy and other data quality management procedures:
Rule 1: Be careful with data sources . They should have a certain hierarchy of reliable data sources because not all of them contain equivalent information. The data from open or relatively unreliable sources should always be confirmed. The social network is a good example of such a questionable data source:
• It can be impossible to track the time when a certain event happened on social media.
• You cannot say with certainty where the information comes from.
• Or it can be difficult for algorithms to recognize emotions that are conveyed in user contributions.
Rule 2: Organize proper storage and transformation. Your data lakes and data warehouses need to be maintained if you want to achieve good data quality. And a fairly "powerful" data cleansing tool must be in place while your data is being transferred from a data lake to a big data warehouse. Also, your data needs to be matched against all other required records at this point in order to achieve some level of consistency (if it is necessary at all).
Rule 3: Conduct regular audits. We have already covered this, but this issue deserves special attention. Data quality audits, like audits of your big data solution, are an integral part of the maintenance process. Maybe you need manual as well as automatic audits. For example, you can analyze your data quality problems and write scripts that run regularly and examine your data quality problems.