Data testing: The neglected step. Why trust your data platform… | by Manvik Kathuria | October, 2022

Why does your data platform have trust issues?

Photo by CDC on Unsplash

Last week, I discussed test-driven development and its relevance in the agile world with my manager. We both come from a core software engineering background and share insights from our past experiences. All this while trying to address the data quality in our data platform.

That’s when he mentioned that TDD would be perfect for data as engineers know what to expect before ingesting the data. They can write tests that the ingested data matches business, engineering and modeling expectations. I couldn’t agree more, and here I am, writing more about the state of data testing through a software engineering lens.

According to the Tech Jury Statistics blog, poor data quality cost the US economy about $3.1 trillion in 2020.  It is expected that more enterprises will invest in a solid data collection foundation that will help streamline algorithmic output and improve employee efficiency.  Image of man smoking cigar Caption: I run a data-driven company based on low-quality data

The tweet aptly represents the current state of data testing, and the low-quality data organizations are boasting about. Data testing is an important component of analytics and data science and is mainly put on the back burner due to prioritization and lack of skills and processes within data teams.

Not to say that everyone is like this, but testing millions of records and gigabytes of data scares engineers from testing it. Software engineering has seen iteration over the years, with mature approaches being developed and executed from time to time. Data engineering, however, has not yet been a new concept and has not reached the maturity required to deliver quality results.

Data pipelines can be complex, performing one or more ELT/ETL steps. Testing your output at each step will help build confidence in your data, identifying regression issues in the long run. Unfortunately, most teams skip the critical phase of data validation and start building reports on top of modeled data. One of the best ways to prevent downstream issues is to combine data quality checks before your data enters your data platform.

If that’s not an option, you can certainly do them as the first thing to do after ingesting your data. Do not attempt to perform these checks manually but instead employ testing techniques to run automated tests to verify various aspects of your data.

The data strategy seeks to create a sophisticated modern data platform to democratize data within the organization. There are many ways to build scalable, modern data platforms, and most organizations have developed them successfully. The bitter truth is that they still suffer from trust issues. An excellent data platform is useless if the data quality is poor. In contrast, high quality data is far more helpful in Excel. Don’t forget that the ultimate goal is to provide insights from the data to help your organization make informed decisions. The data platform is just an originator and not a destination.

Data testing is not as straightforward as feature testing, and the complexity of executing it successfully makes it unpopular and neglected in the data engineering field. This should not prevent you from testing the data and related changes to ensure that the data is acceptable and of high quality to yield insights. Some challenges include the following:

Volume and Asymmetry

Testing gigabytes of data thoroughly is a challenge in itself. With the amount of data being generated daily, the problem gets magnified. Manually auditing this size of data is slow and prone to error. Adding to this the type of data that needs to be tested and the possibility of it changing over time widens the scope.

data availability

Testing real data in production is not allowed due to your organization’s data governance and security policies. This means that you are either testing copied obfuscated data or creating test data in your non-production environment. Either way, there will be test cases that you may not have exposed and will only be exposed in your production environment.

understanding data

While engineering teams are doing data testing, they may not fully understand the intricacies of the data. It is possible to test the ingestion of raw data to some extent, but testing for changes requires a thorough understanding of the data, business rules, and its relationship to other data sets.

lack of expertise

All organizations are struggling to hire relevant talent, and with so many opportunities for people in software, it is challenging to employ specialist talent. Testing big data requires a different mindset than just testing a feature. Creating and maintaining test cases requires coordination between technology, data steward, data owners and other stakeholders and then automating them to ensure that it covers the volume, variety, value and velocity of data. Is.

Data testing is a must and should be done before putting your data pipeline into the production environment. Similar to software testing, it will help identify anticipated issues in advance. The beauty of data lies in its variance and volume. Engineers can create tests on accepted patterns validating the type, value, distribution, quantity and expected variations of the data. Some techniques that can help you test your data and build trust with users are:

automation

The days of manual testing are over, and the faster you realize it, the better it is for you and your organization. It is impossible to manually test big data and is not future-proof. Automation is the only way to continuously and rapidly test all permutations in your data. There are a number of open-source tools/frameworks that you can take advantage of to automate the testing of your data – some of which pre-generate the tests by analyzing your data with the possibility to tune them later.

step by step test

It is a good practice to test the data in different parts of your pipeline. Don’t test it only in the last layer. Adding tests from ingestion to change will help identify problems early. Data transformations can be very complex, and it is recommended to add data tests before and after your transformations. A good testing strategy should include ingesting, curating/preparing, modeling and testing analytical data. The terms you use for your data may change, but the concept still holds.

upskill

Although testing is an old concept in software engineering, it is relatively recent in big data testing. The industry is growing and learning as it experiments with new and better ways of producing high quality data sets. The only way to trust your data is to make your team efficient in data testing. Continuous collaboration, brainstorming, brown bag sessions, and building data testing exercises help create an environment of continuous learning.

TDD for data

I started with the concept of applying TDD to data. Honestly, I haven’t heard anyone do this and I’m curious if you are practicing the art of TDD for testing your data. When engineers write a pipeline for ingesting data, they most likely know and understand the data and its forms.

If they can write the tests first, hoping that the ingested data will fit the test cases, it will make the data ingestion phase more reliable. Similarly, the expectation is already set when a user makes changes on top of the ingested data. Having test cases before writing transformations will allow for faster testing of curated data.

data testing strategy

Although I mentioned it last, it is the first pillar of building trust in your data. Having a clear, well documented data testing strategy will save you hours of rework and testing in the long run. When you set a precedent that engineers must test before rolling out your pipelines to production, you define a process and implement a gatekeeper to ensure that users are not able to deal with garbage data. are not negotiating.

Further explanation of the tools, framework and implementation will clarify the organization’s stance on the subject and bring consistency across multiple data teams. Don’t forget to revisit your strategy from time to time and update it with feedback from engineers.

“We believe in God. Everyone else has to bring the data.” , This quote by W Edwards Deming shows the importance of data to an organization.

I think the scale and magnitude of the data has changed significantly, and it is time to change this quote a bit. From my point of view, it should read as

“We believe in God. Everyone else should bring reliable data.”

Leave a Comment