Cookie Notice

This site uses cookies for performance, analytics, personalization and advertising purposes.

For more information about how we use cookies please see our Cookie Policy.

Manage Consent Preferences

Essential/Strictly Necessary Cookies

Required

These cookies are essential in order to enable you to move around the website and use its features, such as accessing secure areas of the website.

Analytical/ Performance Cookies

These are analytics cookies that allow us to collect information about how visitors use a website, for instance which pages visitors go to most often, and if they get error messages from web pages. This helps us to improve the way the website works and allows us to test different ideas on the site.

Functional/ Preference Cookies

These cookies allow our website to properly function and in particular will allow you to use its more personal features.

Targeting/ Advertising Cookies

These cookies are used by third parties to build a profile of your interests and show you relevant adverts on other sites. You should check the relevant third party website for more information and how to opt out, as described below.

Blog

Resources

Documentation

Kamil Bajda-Pawlikowski

Co-Founder and CTO

Starburst

Why Data Lakehouse Architecture Now?

Last Updated: December 19, 2023

Data Lakehouse Education

Recently I presented with my colleagues, Justin Borgman, CEO of Starburst, and Daniel Abadi, Chief Scientist, at The Data and AI Summit by Databricks and discussed the need for the data lakehouse architecture. The data lakehouse option is a very promising paradigm for data management because it combines the benefits of a data warehouse with the benefits of data lake. With Starburst on top, organizations are better able to analyze their data stored in data lakehouses.

The Need for a Better Approach to Data Management

The data management approach of the past, the data warehouse, requires too much copying and moving of data. With the data warehouse model, there is a lot of preparation and ETL (Extract, Transfer, and Load) necessary for data analysis and; therefore, the time-to-insight is often very slow. So, when businesses turn to their data analysts with a question, they aren’t able to get the answer as quickly and efficiently as they would like.

Traditionally data warehouses have been used to consolidate data; however, moving all this data into one central EDW (Enterprise Data Warehouse) is in many cases not practical: the time required is too long, it is a closed system which causes vendor lock-in, and it is typically very costly. In addition, data in these warehouses is harder to share within an organization.

Benefits of a Data Lakehouse

Data lakehouses have become a solution for some of the problems that a traditional data warehouse has. With a data lakehouse, business intelligence, reporting, data science, and machine learning experts are all able to work together on the same data, in all its different forms, within the data lakehouse. Furthermore, one of the main benefits of a data lakehouse is optionality: Justin highlighted this important change in data architecture which, “[allows] you as the customer, to ultimately have control and ownership over your data.” This optionality is three fold. Firstly, the lakehouse extends a data lake strategy. Customers can leave data in the lowest cost storage, no data movement is required, and bring data governance to the data lake. Secondly, you can use open data formats whenever possible. By using open file formats, you can work with multiple tools to access the same data and avoid vendor lock in. Lastly, a lakehouse enables a data consumption layer. This is where Starburst comes into play by being able to access data where it lives and still be able to query all your data together.

Data Sharing: A Critical Progression

Another important aspect of the data lakehouse is the ability to easily share data. Right now, data sharing is a critical process. There has been a flip in the ”Don’t Share Data” mantra. Daniel described this mindset change as, “[Earlier] companies would worry about the risks of sharing data both within an organization and across organizations. Now it’s the opposite, now companies have to worry about the risk of not sharing data.”

Gartner Data Sharing Model

This is where Delta Sharing comes into the picture. Customers have the ability to share live data without having to copy it out. It also supports a wide range of clients using existing data formats, and it is secure and scalable. Delta Sharing works by communicating via a Delta server in order to access different data sets. The customer makes a request to the server for a data set. The server then responds with short-lived URLs that can be accessed directly from the data consumer. This allows for external data to be incorporated into data analysis.

Data Analysis - Under the Hood Diagram

Why Delta Lake?

Delta Lake helps to address many gaps in the data warehouse. It is an open source table format which means that there is no vendor lock-in. It is also stored as Parquet file format which has many the benefits of warehouse technologies, and you can deploy it from anywhere you want. In addition, it has time travel features, exposes metadata and statistics, and allows for data skipping and z-ordering to improve query performance

Why Data Lakehouse?

The data lakehouse combines the best of the data lake and data warehouse. It is, again, open source, but is also high performance, reduces data movement and overall costs, and better supports AI and ML workloads.

Data Lakehouse

Trino and Starburst’s Native Delta Lake Reader

Trino is a an open source high performance MPP SQL engine that has proven to be scalable and has high concurrency. It also allows for a separation of compute and storage, allows for greater optionality, and the ability to deploy it anywhere.

Starburst’s native Delta lake reader supports Delta transaction logs, data skipping and dynamic filtering, and it optimizes queries using file level statistics. The performance is much faster than the speed achievable with just Parquet file format. In a typical flow, data is ingested as it arrives into the system, it is then refined and aggregated, which leads to more efficient data analysis and machine learning later on. You have the ability to connect to a variety of BI and SQL tools and can run all operations from Starburst and Trino directly.

Data Flow Diagram

Data Lakehouse in Practice

During the pandemic, the health care community was hit hard with the need to turn digital and absorb an abundance of data in an efficient and comprehensible way. EMIS Health needed to understand the spread of COVID, while also monitoring other diseases. In addition, once vaccines were made available, there was another need to track the vaccine rollout. This meant that a lot of data from a variety of sources needed to be queried quickly. The query needed to be flexible, secure, and scalable. It was also needed in real time, something always ready for analysis. They found that the data lakehouse combined with Starburst provided a solution to all of their data needs, and is the future of their data analytics.

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

© Starburst Data, Inc. Starburst and Starburst Data are registered trademarks of Starburst Data, Inc. All rights reserved. Presto®, the Presto logo, Delta Lake, and the Delta Lake logo are trademarks of LF Projects, LLC

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

Query your data lake fast with Starburst's best-in-class MPP SQL query engine
Get up and running in less than 5 minutes
Easily deploy clusters in AWS, Azure and Google Cloud

For more deployment options:

Download Starburst Enterprise

Essential/Strictly Necessary Cookies

Analytical/ Performance Cookies

Functional/ Preference Cookies

Targeting/ Advertising Cookies

By Use Cases

By Industry

Documentation

Connect

Education

Blog

Resources

Pages

Documentation

Why Data Lakehouse Architecture Now?

Last Updated: December 19, 2023

Related posts

Get Started with Starburst Galaxy today

The Need for a Better Approach to Data Management

Benefits of a Data Lakehouse

Data Sharing: A Critical Progression

Why Delta Lake?

Why Data Lakehouse?

Trino and Starburst’s Native Delta Lake Reader

Data Lakehouse in Practice

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

Start Free with
Starburst Galaxy

For more deployment options:

Essential/Strictly Necessary Cookies

Analytical/ Performance Cookies

Functional/ Preference Cookies

Targeting/ Advertising Cookies

By Use Cases

By Industry

Documentation

Connect

Education

Starburst Galaxy

Starburst Enterprise

By Use Cases

By Industry

Documentation

Connect

Education

Filter:

Blog

Resources

Pages

Documentation

Why Data Lakehouse Architecture Now?

Last Updated: December 19, 2023

Related posts

Starburst Enterprise LTS Backport Releases

Introducing New Data Observability Features in Starburst Galaxy – Now in Public Preview

Automating the “Icehouse” – Fully-managed Open Lakehouse Platform on Starburst Galaxy

What’s New in Starburst Galaxy – April 2024

Get Started with Starburst Galaxy today

The Need for a Better Approach to Data Management

Benefits of a Data Lakehouse

Data Sharing: A Critical Progression

Why Delta Lake?

Why Data Lakehouse?

Trino and Starburst’s Native Delta Lake Reader

Data Lakehouse in Practice

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

Start Free withStarburst Galaxy

For more deployment options:

Start Free with
Starburst Galaxy