I am increasingly getting asked about the difference between the Data Fabric and the Data Mesh. They are both emerging paradigms designed to solve a prevalent problem in modern data management for large enterprises, and if you only have a surface understanding of these two ideas, they sound very similar. Nonetheless, they are fundamentally different techniques, that make vastly different technical assumptions, and it seems appropriate to explain the differences in a public forum.
Before we get to the differences, let’s start with the key similarity between them: they are both techniques that are attempting to solve the exact same problem:
The Problem Statement
People are beginning to fully appreciate the value of data to running a successful enterprise. Human intuition is fundamentally fallible, memory is temporary, and basing decision-making on real data is critical to success in the modern world. Once the value of data is appreciated, enterprises start accruing (and hoarding) it from all kinds of different sources. This has led to a massive scale problem along two dimensions. First the datasets themselves are large and getting larger as the velocity with which data sources increase their measurements accelerates. Second, the number of unique sources increases as it becomes easier to track and collect data from an ever-increasing number of different methods.
The first scalability problem is typically easier to handle with money and technology. Scaling data storage is typically quite easy --- the data simply can be partitioned (divided) across more machines, and analyzing large datasets can often be done in parallel across all the machines storing partitions of the data.
Unfortunately, the second scalability problem is much more complex. Different data sources typically mean different data semantics, and the differing semantics makes unifying and integrating these datasets a challenge. Different datasets may use different terms to refer to the same real-world entities, may associate different attributes (and types of those attributes) with the same real-world entities, and may differ in the context in which data was sampled. Data scientists and analysts want to query all datasets to get a holistic view of a particular inquiry, yet there is so much semantic difference across the datasets that querying all the different datasets together yields more chaos and uncertainty rather than more understanding and insight. On the other hand, avoiding querying all the datasets eliminates the potential value they can bring.
At the end of the day, data integration and semantic unification must occur, and today that is typically performed by a centralized human team that invariably becomes an organizational bottleneck, delaying the availability of data to data analysts, increasing the bureaucratic annoyances of incorporating new datasets into existing data pipelines, and generally resulting in reduced organizational agility and quality of data analysis.
This bottleneck of relying on a slow, central human team to make new datasets available is a clear problem that desperately needs a solution. The Data Fabric and Data Mesh are two different approaches to solving this same problem.
The Data Fabric Solution
At its core, the Data Fabric is about eliminating humans from the process as much as possible. Datasets originate as silos but are brought into the Fabric via explicitly maintaining connections from a dataset to other datasets in the Fabric. These connections are maintained as metadata about a dataset and are enriched with business semantics through an automated semantics derivation process implemented with artificial intelligence. These semantics enable the formation of a knowledge graph that deepens the connection across datasets and allows data analysts to discover relevant data to a particular analytical process. Analysis jobs then follow the connections across datasets to include a broader swath of data in the analysis.
The Data Fabric does not require replacement of existing infrastructure but instead adds a layer of additional technology above the existing infrastructure that performs metadata management and inference. Data management tools that store and process data across the enterprise may remain in their place, and data from different applications can remain distributed across these tools. Metadata is extracted from these tools and is used to form logical connections across different datasets. This allows datasets to potentially remain under their existing authority without necessarily coming under a centralized governance framework. Nonetheless, the data and metadata in the Data Fabric must be trusted in order for the conclusions that the machine learning techniques formed over the metadata to be accurate. Over time the expectation is that widely used datasets will come under central governance. Furthermore, enterprise-wide metadata management is the foundation of the Data Fabric and works better when the sources of the metadata are integrated. Therefore, Gartner recommends that in a Data Fabric: “You must slowly replace separate data management tools and platforms that are isolated in their use of metadata with tools and platforms that share internal metadata in a much broader sense.” (Gartner, Top Strategic Technology Trends for 2022: Data Fabric, Oct 18, 2021).
The Data Fabric thus reduces human burden in (at least) the following ways:
- Time spent in data discovery decreases as machine learning algorithms help make recommendations about datasets relevant to an analysis task.
- Datasets can be automatically brought together for particular applications, reducing the burden on the human teams that deploy and redeploy infrastructure for those applications.
- In some cases data integration of disparate datasets can be automated, reducing the burden on existing human teams that perform this same integration. In theory, the work of deriving the semantics of a dataset and connecting it to the Fabric is largely automated, with humans only getting involved in exceptional circumstances.
- Other types of data management orchestration across the enterprise can be automated.
The Data Mesh Solution
While the Data Fabric tolerates some amount of distributed data governance and management, it ultimately moves an enterprise in the direction of centralized control over data in the Fabric in order to maximize the reliability of the metadata and output of the machine learning algorithms. In contrast, the Data Mesh more completely embraces distributed data governance and management. Different teams of domain experts maintain control over their own datasets and make them available directly to other teams via “data products” without any requirement to go through a centralized process. There is no built-in mechanism to discover new datasets relevant to a particular analytical task, but a global catalog can be used to facilitate data discovery.
Data governance is performed in a distributed fashion. Although an enterprise may issue standards that apply to data management across the organization, there is no built-in enforcement of those standards, and a domain expert team can make a dataset immediately available via a data product without going through any kind of central approval process. This greatly improves the agility of an organization but comes with obvious risks when standards are ignored.
Integrating and connecting data to existing datasets is done at the discretion of the data product owners. They can choose to go to the effort of performing this data integration, or they can choose to make a dataset available “as is”--as a data silo—and let the data product consumers choose how they want to connect this dataset with other relevant data.
Each data product owner is free to choose a particular set of infrastructure and tools that are used in the collection, generation, storage, and sharing of the data product. Typically, an enterprise will offer a central pool of infrastructure that the individual teams of domain experts can draw from. However, the particular combination of items from this pool that are used is at the discretion of each individual team.
The Data Mesh still includes centralized teams that perform negotiations with vendors to create this global pool of infrastructure resources, make this pool available across the enterprise, and create the aforementioned governance standards. But these centralized teams are never the bottleneck, since they only serve to enhance the efforts of the distributed teams, and do not block their progress. Thus, the Data Mesh eliminates the main source of scalability and agility limitations in modern enterprise data management.
What’s the Main Difference?
There is clearly a lot of overlap between the Data Fabric and the Data Mesh. They both allow data to be managed in a distributed fashion across an enterprise and avoid central human team bottlenecks. However, there is a clear difference:
The Data Fabric still requires a central human team that performs critical functions for the overall orchestration of the Fabric. Nonetheless, in theory, this team is unlikely to become an organizational bottleneck because much of their work is automated by the artificial intelligence processes in the Fabric. In contrast, in the Data Mesh, the human team is never on the critical path for any task performed by data consumers or producers. However, there is much less emphasis in replacing humans with machines, but rather, shifting the human effort to the distributed teams of domain experts who are the most competent in performing it.
In other words, the Data Fabric fundamentally is about eliminating human effort, while the Data Mesh is about smarter and more efficient use of human effort.
Data Fabric vs. Data Mesh: What Should You Use?
Some of the Data Fabric ideas are not mutually exclusive with the Data Mesh. For example, the Data Mesh still needs a global catalog of data to help with data discovery, and this can be implemented using some of the metadata management practices of the Data Fabric. Furthermore, a centralized Data Fabric can coexist with a Data Mesh by becoming a giant data product within a broader Data Mesh.
Nonetheless, the top-down control of the Data Fabric is fundamentally antithetical to the bottom-up process of the Data Mesh, so an enterprise cannot truly embrace both approaches. Ultimately, each enterprise will need to pick a side and decide whether they have a “Data Mesh” bottom-up mentality or “Data Fabric” top-down mentality to enterprise data management.
So which approach is best? Both sides accuse the other of an unrealistic pie-in-the-sky view of reality. Data Mesh advocates view the use of artificial intelligence in the Data Fabric to automatically generate the semantics of data and perform data integration as a laughable overestimation of the power of AI. Context and implicit knowledge is critical in understanding a dataset, and they believe that data integration is best done by human domain experts. They also point to the failure of the Linked Data vision of Tim Berners-Lee (the inventor of the world wide web) to make a significant real-world impact in the more than two decades that it has been around. [The Linked Data vision is similar to the Data Fabric in the way that ontologies and connections between data are used to derive semantics.] In response, the Data Fabric advocates point to the existing case studies at several large enterprises and point out that AI has come a long way in the past two decades.
The Data Fabric advocates view the fully distributed data management practice of the Data Mesh as a recipe for chaos, silos, and lack of adherence to standards and global identifiers. They point out that distributed data governance is unlikely to succeed without central enforcement. In response, Data Mesh advocates say that the Data Mesh will actually result in fewer silos, because it is easier to make datasets available to other teams, and as long as they are incentivized properly, data product owners will perform the effort of integrating their products with other datasets within the enterprise, and will certainly do a better job of this integration since they are the domain experts for these data sets. And as far as data governance, strong leadership, training, and best practices within the enterprise can overcome the inherent challenges of doing distributed governance.
So which side will win? We’ll find out in the coming years. However, I suspect that we’ll see more Data Mesh deployments in the near term since the Data Mesh can be implemented today with existing technology, whereas the Data Fabric relies on further developments in machine learning-driven data integration before it can truly take off. The success of the early Data Mesh movers will likely determine the course and ultimate winner of this debate.