Cookie Notice

This site uses cookies for performance, analytics, personalization and advertising purposes.

For more information about how we use cookies please see our Cookie Policy.

Manage Consent Preferences

Essential/Strictly Necessary Cookies

Required

These cookies are essential in order to enable you to move around the website and use its features, such as accessing secure areas of the website.

Analytical/ Performance Cookies

These are analytics cookies that allow us to collect information about how visitors use a website, for instance which pages visitors go to most often, and if they get error messages from web pages. This helps us to improve the way the website works and allows us to test different ideas on the site.

Functional/ Preference Cookies

These cookies allow our website to properly function and in particular will allow you to use its more personal features.

Targeting/ Advertising Cookies

These cookies are used by third parties to build a profile of your interests and show you relevant adverts on other sites. You should check the relevant third party website for more information and how to opt out, as described below.

Blog

Resources

Documentation

Brian Luisi

Director, Solution Architects

Starburst

Accelerating Data Science with Trino

Last Updated: April 6, 2023

Data Lake Data Scientist Datanova Head of Analytics Query Federation SQL Trino

At our Datanova for Data Scientists conference on July 14, I held a discussion with Dain Sundstrom and David Philips, CTOs of Starburst, about how Trino and Starburst can make the role of the data scientist easier and more efficient. Data access and exploration are extremely important parts of the data scientist’s role, so they need an engine that will allow them to query quickly and explore new predictive models. Trino and Starburst can help them do that.

An Introduction to Trino

No matter what type of repository organizations decide to use, today’s approach to data management still tends to be monolithic. This means that time to insights is slow, the architecture is expensive and complex, and it is hard to secure. In this “old” world of data management, data scientists often end up running experiments on stale or incomplete data or performing extraneous data wrangling steps which limits efficiency.

However, with Trino, the goal is to move from a monolithic environment to a single point of access that supports a distributed Data Mesh approach. With this method, organizations will be able to treat data as a product, and through Trino and Starburst, data is accessed directly and seamlessly where it lives. The role of data scientists is to work through large quantities of data and then build predictive models from them. So, data scientists are the people who are tasked with blazing new territory, which means finding answers to cutting-edge questions in addition to developing new questions themselves. As a consequence, data needs to be readily available for processing and analysis, which is what Starburst and Trino provide.

What Does Data Access Mean for Data Scientists?

A majority of time that data scientists spend is exploring and processing the data, instead of actually using data to make predictions. Starburst is a distributed query engine that can interact with all of an organization’s data, can pull in different data sets, and iterate through all of this data very quickly. This saves time for data scientists to actually be able to explore the data and make models.

Ultimately what this means for data scientists is that we reduce cycle time for model development. Trino and Starburst allow faster training and testing of models by breaking down data silos via federation and providing faster access to large amounts of data stored in the data lake. This allows data scientists to ask more questions from their data and get more insights for their organizations.

Use Cases of Trino and Data Scientists

Both Dain and David come from Facebook, where they began this journey towards Trino (the open-source project formerly known as Presto SQL). One of the use cases for Trino was marketing related to the ads team. Facebook had extremely large data sets, more data than anyone wanted to spend the time to process. However, with Trino, they could improve their ability to subset their data, improve their efficiency, and bring in all the initial data to their system. Data consumers were able to determine what data is available, what it looks like, and where it is through Trino.

The ability to view data and query it quickly was important for Facebook, but it is just as important for other companies such as those in financial services. For example, customer segmentation, anti-money laundering, wealth management and risk mitigation are important use cases that Trino supports. In addition, it also helps organizations better understand their customers.

Best Practices for Data Scientists

“Know your data.” This is probably the most important practice for data scientists. Data profiling and exploration activities are more rapidly done through the Trino SQL layer. When data is initially collected, it is not going to be organized in a way that is fit for the purpose of data scientists. Therefore, data scientists have to build up datasets for analysis while updating their models. You want to do as much of this pre-processing as possible in SQL. This is where Trino can provide value. You have the ability to push pre-processing into the Trino SQL layer where the collection happens before doing analysis in Python or R or your language of choice.

Most of the work for data scientists comes on the front end. That is simplified by doing sampling, transformations and drilling down in Trino. The most efficient way to do this is with SQL. Because of Trino’s SQL-based MPP query engine, data scientists can process and prepare data much faster than with a Python script. It can perform thousands of queries in clusters and make the data prep work more efficient thereby reducing the time-to-insight.

How Does Trino Help with SRE DevOps Challenges?

Operational data can be treated as normal source data and collected and analyzed by Trino.

Trino doesn’t change the basic DevOps approach of handling code or code promotion. Source code is checked into a repository as it normally is and promoted into higher environments. Model development is essentially a development activity. Once that model is developed the code is treated as typical source code.

Both Trino and Starburst can help data scientists to improve their efficiency and make their access to data much more seamless. Because all their data is able to be accessed in one place, data scientists can better make predictions, explore their data, and find new questions to ask.

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

© Starburst Data, Inc. Starburst and Starburst Data are registered trademarks of Starburst Data, Inc. All rights reserved. Presto®, the Presto logo, Delta Lake, and the Delta Lake logo are trademarks of LF Projects, LLC

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

Query your data lake fast with Starburst's best-in-class MPP SQL query engine
Get up and running in less than 5 minutes
Easily deploy clusters in AWS, Azure and Google Cloud

For more deployment options:

Download Starburst Enterprise

Essential/Strictly Necessary Cookies

Analytical/ Performance Cookies

Functional/ Preference Cookies

Targeting/ Advertising Cookies

By Use Cases

By Industry

Documentation

Connect

Education

Blog

Resources

Pages

Documentation

Accelerating Data Science with Trino

Last Updated: April 6, 2023

Related posts

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

Start Free with
Starburst Galaxy

For more deployment options:

Essential/Strictly Necessary Cookies

Analytical/ Performance Cookies

Functional/ Preference Cookies

Targeting/ Advertising Cookies

By Use Cases

By Industry

Documentation

Connect

Education

Starburst Galaxy

Starburst Enterprise

By Use Cases

By Industry

Documentation

Connect

Education

Filter:

Blog

Resources

Pages

Documentation

Accelerating Data Science with Trino

Last Updated: April 6, 2023

Related posts

Introducing New Data Observability Features in Starburst Galaxy – Now in Public Preview

Automating the “Icehouse” – Fully-managed Open Lakehouse Platform on Starburst Galaxy

What’s New in Starburst Galaxy – April 2024

Starburst brings enterprise-grade SQL analytics to Google Distributed Cloud’s air-gapped solution for regulated customers

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

Start Free withStarburst Galaxy

For more deployment options:

Start Free with
Starburst Galaxy