The heart of Data Mesh beats real-time with Apache Kafka

If there was talk of timing, it would undoubtedly be a “data trap!” Would. This new architectural paradigm unlocks large-scale analytical and transactional data and enables rapid access to a growing number of distributed domain datasets for various usage scenarios. Data Mesh addresses the most common weaknesses of traditional centralized data lake or data platform architectures. And the heart of decentralized data mesh infrastructure should be real-time, reliable and scalable. Learn how Apache Kafka, the de facto standard for streaming data, plays a vital role in building data meshes.

There is no single technology or product for Data Mesh!

This post explores how Apache Kafka, as an open and scalable decentralized real-time platform, can be the basis of a data mesh infrastructure and – complemented by many other data platforms such as Data Warehouse, Data Lake and Lakehouse – Solves real business problems.

There is no silver bullet or single technology/product/cloud service for implementing a data mesh. The main result of a data mesh architecture is the ability to build data products; With the right tools for the job. A good data mesh combines data streaming technology such as Apache Kafka or Confluent Cloud with a cloud-native data warehouse and data lake architecture from Snowflake, Databricks, Google BigQuery, et al.

What is Data Mesh?

I will not write another article describing the concepts of Data Mesh. Zhamak Dehghani coined the term in 2019. Following The Data Mesh Architecture from the 30,000-Foot View explains the basic idea well:

I summarize the data mesh as the following three bullet points:

  • One architectural paradigm With many historical influences (domain-driven design, microservices, data marts, data streaming)
  • not specific to any one technology or product, No single vendor can implement a data mesh alone
  • Handling data as product is a fundamental change, enabling a more flexible architecture and independent solution to individual business problems
  • decentralized services, Not only analytics but also transactional workloads

Why handle the data as a product?

It is insufficient to talk about innovative technology to introduce a new architectural paradigm. As a result, it is also important to measure the business value of enterprise architecture.

McKinsey found that “when companies manage data like a consumer product — whether digital or physical — they can realize near-term value from their data investments.” And Pave the way for getting more value tomorrow. Creating reusable data products and patterns to tie data technologies together enables companies to derive value from data today and tomorrow.”

McKinsey - Why Handle Data as a Product

For McKinsey, the benefits of this approach could be significant:

  • New business use cases can be delivered up to 90 percent faster
  • Total cost of ownership, including technology, development and maintenance, could drop by 30 percent
  • Risk and burden of data-governance can be reduced

What is data streaming with Apache Kafka and its relation to Data Mesh?

A data mesh enables decentralization and flexibility through best-of-breed data products. Data sharing requires reliable real-time data at any scale between data producers and data consumers. Additionally, correct decoupling between decentralized data products is the key to the success of the data mesh paradigm. Each domain should have access to shared data, but also the ability to choose the right tools (ie, technology, API, product, or SaaS) to solve its business problems.

This is where data streaming fits into the data mesh story:

Flexibility through decentralization and best of breed with data streaming

The de facto standard for streaming data is Apache Kafka. A cloud-native data streaming infrastructure that can interconnect clusters to each other enables the creation of a modern data mesh. No data mesh will use only one technology or vendor. Learn from inspiring posts from your favorite data product vendors such as AWS, Snowflake, Databricks, Confluent, and more on how to successfully define and build your custom data mesh. Data meshes are a journey, not a big bang. A data warehouse or data lake (or in modern days, a lakehouse) cannot be the only infrastructure for data meshes and data products.

Example: Real-time data fabrication in a hybrid cloud

Here’s an example spanning a streaming data mesh across multiple cloud providers such as AWS, Azure, GCP, or Alibaba, and on-premises/edge sites:

Hybrid Cloud Streaming Data Mesh Powered by Apache Kafka and Cluster Linking

This example illustrates all the features discussed in the above sections for a data mesh:

  • Decentralized real-time infrastructure in domain and infrastructure
  • true decoupling Between domains and between clouds
  • many communication model, Data Streaming, including RPC and Batch
  • data integration With legacy and cloud-native technologies
  • continuous stream processing where it adds value, and batch processing some analytics in sync

Data Mesh provides flexibility and freedom of technology choice for each data product

The heart of a decentralized data mesh infrastructure should be real-time, reliable and scalable. As the de facto standard for data streaming, Apache Kafka plays an important role in cloud-native data mesh architecture. Nevertheless, the data mesh is not bound to a specific technology. The beauty of a decentralized architecture is the freedom of technology choice for each business unit when building its data products.

Sharing data within domains and between organizations is another aspect where data streaming helps in data mesh. Real-time data beats slow data. This is true not only for most business problems across industries but also for duplicating data between data centers, clouds, regions or organizations. A streaming data exchange enables the sharing of data in real-time to create a data mash in motion.

Have you started building your data mesh? What does enterprise architecture look like? What frameworks, products and cloud services do you use? Is the heart of your data in motion in real time or is some lakehouse at rest?

Leave a Comment