At our Datanova for Data Scientists conference on July 14, I held a discussion with Dain Sundstrom and David Philips, CTOs of Starburst, about how Trino and Starburst can make the role of the data scientist easier and more efficient. Data access and exploration are extremely important parts of the data scientist's role, so they need an engine that will allow them to query quickly and explore new predictive models. Trino and Starburst can help them do that.
An Introduction to Trino
No matter what type of repository organizations decide to use, today’s approach to data management still tends to be monolithic. This means that time to insights is slow, the architecture is expensive and complex, and it is hard to secure. In this “old” world of data management, data scientists often end up running experiments on stale or incomplete data or performing extraneous data wrangling steps which limits efficiency.
However, with Trino, the goal is to move from a monolithic environment to a single point of access that supports a distributed Data Mesh approach. With this method, organizations will be able to treat data as a product, and through Trino and Starburst, data is accessed directly and seamlessly where it lives. The role of data scientists is to work through large quantities of data and then build predictive models from them. So, data scientists are the people who are tasked with blazing new territory, which means finding answers to cutting-edge questions in addition to developing new questions themselves. As a consequence, data needs to be readily available for processing and analysis, which is what Starburst and Trino provide.
What Does Data Access Mean for Data Scientists?
A majority of time that data scientists spend is exploring and processing the data, instead of actually using data to make predictions. Starburst is a distributed query engine that can interact with all of an organization’s data, can pull in different data sets, and iterate through all of this data very quickly. This saves time for data scientists to actually be able to explore the data and make models.
Ultimately what this means for data scientists is that we reduce cycle time for model development. Trino and Starburst allow faster training and testing of models by breaking down data silos via federation and providing faster access to large amounts of data stored in the data lake. This allows data scientists to ask more questions from their data and get more insights for their organizations.
Use Cases of Trino and Data Scientists
Both Dain and David come from Facebook, where they began this journey towards Trino (the open-source project formerly known as Presto SQL). One of the use cases for Trino was marketing related to the ads team. Facebook had extremely large data sets, more data than anyone wanted to spend the time to process. However, with Trino, they could improve their ability to subset their data, improve their efficiency, and bring in all the initial data to their system. Data consumers were able to determine what data is available, what it looks like, and where it is through Trino.
The ability to view data and query it quickly was important for Facebook, but it is just as important for other companies such as those in financial services. For example, customer segmentation, anti-money laundering, wealth management and risk mitigation are important use cases that Trino supports. In addition, it also helps organizations better understand their customers.
Best Practices for Data Scientists
“Know your data.” This is probably the most important practice for data scientists. Data profiling and exploration activities are more rapidly done through the Trino SQL layer. When data is initially collected, it is not going to be organized in a way that is fit for the purpose of data scientists. Therefore, data scientists have to build up datasets for analysis while updating their models. You want to do as much of this pre-processing as possible in SQL. This is where Trino can provide value. You have the ability to push pre-processing into the Trino SQL layer where the collection happens before doing analysis in Python or R or your language of choice.
Most of the work for data scientists comes on the front end. That is simplified by doing sampling, transformations and drilling down in Trino. The most efficient way to do this is with SQL. Because of Trino’s SQL-based MPP query engine, data scientists can process and prepare data much faster than with a Python script. It can perform thousands of queries in clusters and make the data prep work more efficient thereby reducing the time-to-insight.
How Does Trino Help with SRE DevOps Challenges?
Operational data can be treated as normal source data and collected and analyzed by Trino.
Trino doesn’t change the basic DevOps approach of handling code or code promotion. Source code is checked into a repository as it normally is and promoted into higher environments. Model development is essentially a development activity. Once that model is developed the code is treated as typical source code.
Both Trino and Starburst can help data scientists to improve their efficiency and make their access to data much more seamless. Because all their data is able to be accessed in one place, data scientists can better make predictions, explore their data, and find new questions to ask.