Cookie Notice

This site uses cookies for performance, analytics, personalization and advertising purposes.

For more information about how we use cookies please see our Cookie Policy.

Manage Consent Preferences

Essential/Strictly Necessary Cookies

Required

These cookies are essential in order to enable you to move around the website and use its features, such as accessing secure areas of the website.

Analytical/ Performance Cookies

These are analytics cookies that allow us to collect information about how visitors use a website, for instance which pages visitors go to most often, and if they get error messages from web pages. This helps us to improve the way the website works and allows us to test different ideas on the site.

Functional/ Preference Cookies

These cookies allow our website to properly function and in particular will allow you to use its more personal features.

Targeting/ Advertising Cookies

These cookies are used by third parties to build a profile of your interests and show you relevant adverts on other sites. You should check the relevant third party website for more information and how to opt out, as described below.

Blog

Resources

Documentation

Brian Olsen

Developer Advocate

Starburst

A Gentle Introduction to the Hive Connector

TL;DR: The Hive connector is what you use in Starburst Enterprise for reading data from object storage that is organized according to the rules laid out by Hive, without using the Hive runtime code.

Last Updated: March 13, 2024

Analytics Engineer Data Architect Data Engineer Data Scientist Developer Head of Analytics How-To Guides Trino

One of the most confusing aspects when starting with the Hive connector comes from the complex Hive model and overlapping use cases of this connector. Typically, you seek out the use of Starburst when you experience an intensely slow query turnaround from your existing Hadoop, Spark, or Hive infrastructure. In fact, the genesis of our core technology, Trino, (formerly known as PrestoSQL), came about due to these slow Hive query conditions at Facebook back in 2012.

So when you learn that Starburst and Trino have a Hive connector, it can be rather confusing since you moved to these distributions to circumvent the slowness of your current Hive cluster. Another common source of confusion is when you want to query your data from your cloud object storage, such as AWS S3, MinIO, and Google Cloud Storage. This too uses the Hive connector. If that confuses you, don’t worry, you are not alone. This blog aims to explain this commonly confusing nomenclature.

NOTE: Since Starburst builds their solution on Trino, we will use Trino instead of Starburst for the rest of the article.

Hive architecture

To understand the origins and inner workings of Trino’s Hive connector, you first need to know a few high level components of the Hive architecture.

Hive Architecture

Hive architecture in four components:

Runtime

The runtime contains the logic of the query engine that translates the SQL -esque Hive Query Language(HQL) into MapReduce jobs that run over files stored in the filesystem.

Storage

The storage component is simply that, it stores files in various formats and index structures to recall these files. The file formats can be anything as simple as JSON and CSV, to more complex files such as columnar formats like ORC and Parquet. Traditionally, Hive runs on top of the Hadoop Distributed File System (HDFS).

As cloud-based options became more prevalent, object storage like Amazon S3, Azure Blob Storage, Google Cloud Storage, and others needed to be leveraged as well and replaced HDFS as the storage component.

Metastore

In order for Hive to process these files, it must have a mapping from SQL tables in the runtime to files and directories in the storage component. To accomplish this, Hive uses the Hive Metastore Service (HMS), often shortened to the metastore to manage the metadata about the files such as table columns, file locations, file formats, etc.

Data organization specification

The last component not included in the image is Hive’s data organization specification. The documentation of this element only exists in the code in Hive and has been reverse engineered to be used by other systems like Trino to remain compatible with other systems.

Trino reuses all of these components except for the runtime. This is the same approach most compute engines take when dealing with data in object stores, specifically, Trino, Spark, Drill, and Impala. When you think of the Hive connector, you should think about a connector that is capable of reading data organized by the unwritten Hive specification.

Trino’s runtime replaces Hive runtime

In the early days of big data systems, many expected query turnaround to take a long time due to the high volume of unstructured data in ETL workloads. The primary goal in early iterations of these systems was simply throughput over large volumes of data while maintaining fault-tolerance. Now, more businesses want to run fast interactive queries over their big data instead of running jobs that take hours and produce possibly undesirable results. Many companies have petabytes of data and metadata in their data warehouse. Data in storage is cumbersome to move and the data in the metastore takes a long time to repopulate in other formats. Since only the runtime that executed Hive queries needs replacement, the Trino engine utilizes the existing metastore metadata and files residing in storage, and the Trino runtime effectively replaces the Hive runtime responsible for analyzing the data.

Trino Architecture

The Hive connector nomenclature

Notice, that the only change in the Trino architecture is the runtime. The HMS still exists along with the storage. This is not by accident. This design exists to address a common problem faced by many companies. It simplifies the migration from using Hive to using Trino. Regardless of the storage component used the runtime makes use of the HMS and that is the reason this connector is the Hive connector.

Where the confusion tends to come from, is when you search for a connector from the context of the storage systems you want to query. You may not even be aware the metastore is a necessity or even exists. Typically, you look for an S3 connector, a GCS connector or a MinIO connector. All you need is the Hive connector and the HMS to manage the metadata of the objects in your storage.

The Hive Metastore Service

The HMS is the only Hive process used in the entire Trino ecosystem when using the Hive connector. The HMS is actually a simple service with a binary API using the Thrift protocol. This service makes updates to the metadata, stored in an RDBMS such as PostgreSQL, MySQL, or MariaDB. There are other compatible replacements of the HMS such as AWS Glue, a drop-in substitution for the HMS.

One last thing; while the Starburst and Open Source Hive connectors have a lot of similarities, Starburst has added several very important features such as Ranger security integration for role-based access control and is tested against multiple Cloudera builds. See the documentation for more details here.

This article was originally posted on the Trino blog.

Hive vs Iceberg: How to migrate your Hive tables to Apache Iceberg

Learn more

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

© Starburst Data, Inc. Starburst and Starburst Data are registered trademarks of Starburst Data, Inc. All rights reserved. Presto®, the Presto logo, Delta Lake, and the Delta Lake logo are trademarks of LF Projects, LLC

Start Free with
Starburst Galaxy

Up to $500 in usage credits included

Query your data lake fast with Starburst's best-in-class MPP SQL query engine
Get up and running in less than 5 minutes
Easily deploy clusters in AWS, Azure and Google Cloud

For more deployment options:

Download Starburst Enterprise

Essential/Strictly Necessary Cookies

Analytical/ Performance Cookies

Functional/ Preference Cookies

Targeting/ Advertising Cookies

By Use Cases

By Industry

Documentation

Connect

Education

Blog

Resources

Pages

Documentation

A Gentle Introduction to the Hive Connector

TL;DR: The Hive connector is what you use in Starburst Enterprise for reading data from object storage that is organized according to the rules laid out by Hive, without using the Hive runtime code.

Last Updated: March 13, 2024

Related posts

Hive architecture

Hive architecture in four components:

Runtime

Storage

Metastore

Data organization specification

Trino’s runtime replaces Hive runtime

Trino Architecture

The Hive connector nomenclature

The Hive Metastore Service

Hive vs Iceberg: How to migrate your Hive tables to Apache Iceberg

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

Start Free with
Starburst Galaxy

For more deployment options:

Essential/Strictly Necessary Cookies

Analytical/ Performance Cookies

Functional/ Preference Cookies

Targeting/ Advertising Cookies

By Use Cases

By Industry

Documentation

Connect

Education

Starburst Galaxy

Starburst Enterprise

By Use Cases

By Industry

Documentation

Connect

Education

Filter:

Blog

Resources

Pages

Documentation

A Gentle Introduction to the Hive Connector

TL;DR: The Hive connector is what you use in Starburst Enterprise for reading data from object storage that is organized according to the rules laid out by Hive, without using the Hive runtime code.

Last Updated: March 13, 2024

Related posts

Hive vs Iceberg: Choosing the best table format for your analytics workload

Improving performance with Iceberg sorted tables

Automated maintenance for Apache Iceberg tables in Starburst Galaxy

Apache Iceberg Time Travel & Rollbacks in Trino

Hive architecture

Hive architecture in four components:

Runtime

Storage

Metastore

Data organization specification

Trino’s runtime replaces Hive runtime

Trino Architecture

The Hive connector nomenclature

The Hive Metastore Service

Hive vs Iceberg: How to migrate your Hive tables to Apache Iceberg

A single point of access to all your data

Stay in the know - Sign up for our newsletter!

Resources

Quick Links

Get In Touch

Start Free withStarburst Galaxy

For more deployment options:

Start Free with
Starburst Galaxy