You might have heard about Data Mesh recently, which is a modern approach to managing data and analytics in a distributed, domain-driven fashion. At its core, Data Mesh applies domain-oriented decomposition and ownership to an organization’s data. Being a software engineer, I see several parallels between Data Mesh and microservices, which applies domain-oriented decomposition techniques to operational systems. Hence we can use our experience with designing microservices-oriented systems to help inform how to implement a Data Mesh architecture.
In a Data Mesh, the fundamental unit of value is a data product. Just as we might have one or more microservices that implement the business logic of a domain, we could have one or more data products that represent the analytical data for the domain. An analyst or data scientist would consume these data products to make data-driven business decisions. For example, a weather company might operate a service that provides real-time temperature information for the specified zip code, and it might publish a couple of data sets - one which has the daily minimum, maximum and average temperatures by zip code, and another with the hourly minimum, maximum and average temperature by zip code.
When we think of a functional system for deploying and managing microservices-based applications, there are several features that we’ve come to expect from the platform:
- Feature-rich operating system
- Easy composition
- Easy discovery
- Authentication & authorization framework
- Easy deployment
- Usage and performance metrics
I will talk about each of these features as they apply to microservices and how they translate to Data Mesh, and why Starburst Enterprise is the ideal solution for implementing a Data Mesh architecture.
Feature-rich operating system
To be useful, we need a feature-rich operating system for microservices to run on - these days, Kubernetes has become the de facto standard for such an OS, making it easy to manage and scale applications built on services.
In Data Mesh, the “operating system” is what’s known as the self-service platform. This is where Starburst Enterprise Platform comes in, by providing a powerful, scalable analytical SQL query engine - Trino - that makes it very easy to consume a variety of data products. As with Kubernetes, Starburst is built for cloud environments and can scale to the largest workloads in the world while providing best-in-class performance. It offers JDBC and ODBC connectivity for seamless operation with all the popular BI and ML tools.
To facilitate application development using microservices, they need to be easily composable - typically we implement this by using standard protocols like REST over HTTP, as well as agreeing on API calling conventions and standards. Imagine trying to build an application that consumes services built with REST, gRPC and SOAP!
To make a service easily consumable by a variety of applications, we typically use a widely-adopted standard data format like JSON, which allows for flexibility in the consuming application. For example, the consuming application could be built on NodeJS, or in Java or .NET, all of which have well-supported libraries for working with JSON data.
For the same reason, in a Data Mesh, we need to make data products easily composable - if an analyst needs to jump through hoops to incorporate a data set, it makes their job so much harder. With Trino, using connectors, all data products can be exposed as SQL views regardless of the format of the underlying data sources. By standardizing on SQL, Trino lets consumers easily combine multiple data products to derive insights quickly. With the weather example mentioned above, for example, an analyst could perform a SQL join between the daily temperature data and daily store transaction data to find patterns and correlations between store foot traffic and the average temperature.
Trino also makes it very easy to use data products in other analysis engines, such as Apache Spark for machine learning. For example, we can use Trino to directly save data into an object store (such as HDFS, Azure Data Lake or Amazon S3) in the commonly supported Spark file formats (such as Avro, Parquet, CSV or JSON). Hence, if we create a data product using such tables, we can then use Spark to directly access the data from the underlying object store without having to export it from Trino. Likewise, if a dataset is the result of a machine learning algorithm and saved as, say a Parquet file in object storage, we can use Trino’s support for object storage to directly query that dataset as a SQL table.
Easy discovery and accessibility
In order to call a microservice, it needs to be easily discoverable and accessible. A platform like Kubernetes provides APIs to list available services and an in-cluster DNS to easily address any service by name (by establishing a convention that services are accessible using <service-name>.<namespace>.svc.cluster.local). This is incredibly powerful as it frees developers from having to worry about routing and network access to their service as the platform just takes care of it. In addition, typically we’d use something like OpenAPI for documenting all the APIs, which helps the consumer easily call the service.
Similarly, in a Data Mesh, for a data product to be useful, it needs to be easily discoverable and accessible. With a platform like Trino, we can use SQL information schema to easily find the available data products and their information. For example, if we assume that the weather data products are published to a schema named weather in the catalog named data_mesh, we can find all the available tables using:
show tables from data_mesh.weather;
If we want to know the available columns in the daily_temperatures table, we can use:
show columns from data_mesh.weather.daily_temperatures;
In addition, we can use a data catalog product such as Amundsen or Apache Atlas, both of which integrate with Starburst Enterprise, to provide search capabilities and additional metadata for data products. Thus, exposing the data product to a Trino cluster frees the publisher from having to worry about making it accessible and easily discoverable by consumers.
Authentication and authorization framework
For any application built with microservices, authentication and authorization are a critical piece of the platform - not all services should be accessible to all users or other services. We could implement such a framework using an Open ID Connect provider, or using primitives provided by Kubernetes itself (such as identities, roles and service accounts). We can further control how services are exposed to the outside world using an ingress, which lets us limit what endpoints are accessible from outside the cluster.
In a Data Mesh, it’s important to control who has access to what kind of data - for example, the raw data set from weather stations shouldn’t be accessible to anyone except the team responsible for generating and maintaining that data. The daily and hourly temperatures would be an aggregation of the raw weather station data and represents the “public API” for data products published by the team, and most users would have read-only access to these tables. Using a system like Starburst Enterprise lets us define fine-grained role based access to tables, views, rows and columns. Creating a view with SQL security DEFINER lets users have access to views even though they don’t have access to the underlying tables that the view is created from. With such a powerful and flexible system, publishers can control exactly who has access to what data products.
One of the biggest selling points of Kubernetes is the ability to deploy services and applications in a container (like Docker). This enables developers to build our applications with complete flexibility in the tech stack - we can mix and match services built in Java, NodeJS, .NET and as long as they’re dockerized, they will work seamlessly with one another. Deployment is a breeze as well - all we need is a declarative description of how we’d like a service deployed and Kubernetes handles all the hard work of making it happen.
With Trino, the pluggable connector mechanism allows easily exposing any data product with a SQL interface, regardless of the underlying data. For example, if your data product is a parquet file that represents the output of a machine learning pipeline in Spark, you can use the Hive connector to make it available as a SQL table within Trino. The myriad built-in connectors, as well as the ability to add custom connectors, allows complete flexibility in publishing your data products. With the weather example above, the source weather station data could be flat files in a data lake, while the aggregated daily_temperatures and hourly_temperatures tables might be relational tables in a MySQL database.
Usage and performance metrics
An important element of the domain-oriented ownership of services is ensuring the continuous operation and lifecycle of the microservices. This includes deploying new versions, rolling back, and monitoring the usage and availability of the services. Without usage and performance metrics, performing these operations would amount to “push the button and pray real hard” which isn’t very reassuring. That’s why every platform dealing with services provides some infrastructure around collecting and visualizing a variety of service metrics, such as the number of successful and failed requests, errors, request time, and request latency. Monitoring these metrics over time and through upgrades helps build confidence around releasing new versions, bug fixes and providing service level agreements (SLAs).
In Data Mesh architectures, the ownership of the quality and lifecycle of data products rests with the domain owners, which means they’re responsible for ensuring that the data products are available, new versions of the data products do not break downstream users, and querying the data products are reasonably performant. Using the event logger plugin provided by Trino goes a long way in providing the necessary usage metrics of the tables and views in a data product. Event logger captures extensive query information including the tables and views it accesses, query time, and other performance metrics. Publishers can use this information to answer questions like “how often is this particular product queried?”, “what’s the average time for a query involving this product?”. Building visualization dashboards around these metrics can help publishers spot problems - for example, if the usage drops drastically after a new version of a table is published, it might indicate a breaking change in the table schema.
Data mesh architecture is a new and evolving paradigm; as we learn more about it through trying out different approaches there will no doubt be additional exciting development. We can learn a lot from parallels with the development of microservice architectures, as well. Stay tuned for updates as our understanding of Data Mesh evolves, and contact us with any questions!