Quality-Driven Data Manifesto. Every Data Engineer and Data Scientist… | by Luhui Hu | November, 2022

Every Data Engineer and Data Scientist should know

Unsplash. Photo by Alexandru Zdrobeau

Data is like ocean water, vast and necessary.

Is the data important? Undoubted. Today, every company and business is data-driven.

Is the data important or valuable? yes or no As we know, “garbage in, garbage out” stands for Machine Learning and Computer Science. Data without quality can be useless to businesses but costly to process and manage.

Now, “data is essential” becomes “data quality is essential.”

Just as you are familiar with Test-Driven Development (TDD), you would say that Quality-Driven Data (QDD) is imperative.

Why? How? What?

Why defining data quality seems infamous. Then how to define data quality? What is data quality? And how to ensure quality-driven data as a whole?

There are many definitions of data quality, but the most obvious one is high data quality and one definition for all. This may not equate to high quality, although it is a primary goal for the data. It may not be exactly the same, although one can no longer ignore it. Therefore, I will make the following statement before proceeding:

Data quality is a state of data in the data lifecycle, which can be defined in degrees.

We can define data quality in detail on data domains, categories, lifecycle, and quality degree.

How to Define Data Quality (by the author)

Where do we need data quality? If you had given the best answers everywhere, you would have learned an unforgettable lesson about data or never used data seriously.

The data quality is domain-specific except in general or is the same everywhere. It is sensitive and becomes meaningful in six domains: business operations, data analysis, data governance, data management, data engineering and data science.

Data Quality Domain (by author)

business operations

There are two main categories of business operations on data: OLTP and OLAP. Data for OLTP is usually stored in a relational database or NoSQL, but for OLAP data is usually stored in a data warehouse, data lake, or data lakehouse. Please see Modern Databases and the Future of Data for more information. Data quality requirements vary widely due to different business requirements and technologies. The difference may lie in the perspective and depth of business and operations.

data analysis

Data analytics includes business intelligence, predictive analytics, and more. It enables data to make business decisions. The accuracy, completeness, consistency and timeliness of the data are important for making the right decisions. These are at the heart of data quality.

data engineering

Data engineering is the field and theory of creating data systems for the data journey from construction to disposal. It is the foundation of the modern data stack. Data quality should be part of data engineering. The pursuit of data quality should go beyond accuracy, completeness, consistency and timeliness. It recognizes the importance of data observability, discoverability, and legibility.

data science

Data science is generally an interdisciplinary field to extract knowledge from large data sets and apply the knowledge and insights from that data to solve problems in a wide range of application domains. This is the core of machine learning. Data quality is the foundation of data science. Data lineage, semantics and statistics have become first-class citizens for data quality in data science.

data management

Data management is the discipline of managing data as a valuable resource. Data quality is an area of ​​data management that mainly focuses on data usage. Its purpose is to control the quality of the data, such as to manage the data. On the other hand, the properties and features of data management affect the data quality from the point of view of the user.

data governance

Data governance is a principle of managing data as an asset during its life cycle. It is a mechanism to ensure that data is secure, private, accurate, available and usable. Data quality is part of data governance. But it enhances data quality for organizations and non-technical people on technology (eg, data engineering and data science). It can balance priorities between data quality and compliance (including security and privacy). It may have to compromise on data quality in favor of privacy and security. This may call for accuracy and consistency in business language, not technical.

The management of data as an asset can be extended to a general concept of data assets. This domain includes data sharing, trading and exchange in the form of digital assets. For example, data mined with NFTs is a digital asset that can be used in the metaverse. Here data quality looks at uniqueness, identity and integrity.

Defining data quality depends on the business and application requirements. In general, it is more business-driven than technology. This means that quality is all about business needs. Otherwise, it will fade to data quality.

Data quality will be expected and defined differently across different data categories, such as retail, manufacturing, logistics, medical, etc. Because of the uniqueness of each category, it can be further refined against data classification and compliance.

For example, DeepMind’s AlphaFold is an important milestone for protein structure with vast data. Amazon Retail also has massive data on operations and analytics. But the scope and expectation of data quality for the two differ in terms of timeliness, completeness, legibility, and more.

Does the expectation of data quality remain unchanged at all times throughout the data lifecycle?

Any data has a lifecycle from creation to collection, use and disposal. The earlier in the lifecycle, the better for quality-driven data. However, requirement and data quality coverage should be different at different lifecycle stages. In most cases, the quality of the data becomes important or meaningful during the usage phase. This does not mean that we should not care about data quality from data nucleation, or that we should not maintain the same high quality throughout the lifecycle.

But we must understand the purpose of data quality. It is necessary to take this into account in engineering to design and comply with the requirements.

For example, sub-second timeliness is important for making a scalability decision when using data or providing a friendly user experience on a retail website. But later it may be unnecessary to have the same criteria for storing the same data.

Data quality can be defined in degrees. I prefer to express in degrees rather than in metrics or dimensions because it needs to be qualitative or quantitative. And it can be more measurable for each degree, countable in degrees, analytic to degrees, correlative between degrees, and cumulative over degrees. There are two data quality degree levels: Quality Fundamental and Quality Advanced degrees.

quality fundamental degree

Data quality fundamental degrees are essential. There are four fundamental quality degrees: accuracy, completeness, consistency and timeliness.

  1. Accuracy: Defined precise data, which includes data content (or value), precision and metadata, although we often forget about accuracy and metadata.
  2. compatibility: similarly-defined data across contexts, pipelines, lineages, systems, and organizations.
  3. completeness: No missing records or values. But it may be a different story for some massive sparse data in deep learning.
  4. timeliness: Up-to-date data and timely response from the service. This is becoming increasingly important for big data and machine learning.

quality advanced degree

Data Quality Advanced degrees are also required but have emerged recently or are better suited for some specific domain. There are nine advanced quality degrees: Specificity, Validity, Relevance, Effectiveness, Observability, Discoverability, Governance, Semantics and Integrity. And it can be scaled up over time according to the data domains and categories.

  1. Relevance: Data relevant to meeting business requirements.
  2. EffectivenessData effectiveness for data processing and machine learning, taking into account the volume, diversity and velocity of data.
  3. semantics: Semantic information for data sets, columns, rows, and even records. This degree can enhance the semantics for data origin, descent, and relation.
  4. Specialty: Data uniqueness is a degree of overlap and reduction of duplication within a data set or against all records in the data set.
  5. validityData validity is the degree to which data values ​​conform to business rules. This does not equate to accuracy or completeness.
  6. observability: Ability to observe data related to visibility, monitoring and debugging. Underlying data statistics and metrics should be part of it. More and more modern data cloud solutions are the first to support this. For example, Delta Lake calculates some statistics and adds them to the data during save, such as Apache Parquet.
  7. to be discovered: Ability to integrate, share and easy to use related data searchability.
  8. governance capacity: The maturity degree of data governance with emphasis on data compliance.
  9. integrity: the degree of data integrity. The term seems to overlap with fundamental accuracy and consistency, but it highlights the data lifecycle against data corruption.

Quality-Driven Data (QDD) is a data acquisition, use and disposal principle. With this action, we can define data quality in degrees, improve decision making and ML quality, and prevent unexpected issues in advance. Data quality is the yer in terms of data domains and data categories. There are two groups of quality degrees to define data quality as a whole.

So the data quality is sophisticated but executable. It is sound to define the degrees related to the data lifecycle and categories. It can be effectively implemented and improved by quality-driven data practice or quality in the form of infra through data governance or integration with other data domains.

Leave a Comment