Explore the next frontier of data

Read the latest news and opinions from our experts

 

Featured Post

Recent Posts

The Right Way to Query Across Data Sources in Tableau (or, The Cross-Database Join Is Not Always Your Friend)

Summary

Use the right tool for the right job. Not doing so means the difference between your Tableau viz rendering in seconds vs. minutes or even hours. Cross-database joins against non-trivial data sets will push you towards minutes and hours. Use an engine that is specifically built for federated querying. 

Introduction

Way back in 2016, Tableau introduced cross-database joins. It was good. It made me happy. That said, the feature wasn’t really built for data at scale. Copying in a bunch of data to Tableau takes too long, and can actually get expensive from a dollars-and-cents perspective if you’re paying data egress charges to your cloud vendor. 

Subsequent improvements to cross-database joins allow you to push data from one source to another and do your join work remotely on that “other”, second data source. This is groovy, except you’re still moving data. That’s a waste of time and could again cost you dinero you’d rather spend elsewhere.

Today I’ll show you the best pattern for query federation in Tableau (and other tools). You’ll avoid:

  • Cross-database joins which unnecessarily copy / move data
  • Mind-bending Data Blending 
  • Scheduled ETL / batch jobs that transform the data before Tableau consumes it (yet more data copying, just done in a different way)

Here, let me prove it.

I don’t expect you to believe everything you read on the Internet, so I’ve created an example for you using TPC-H sample data (size factor 100) and a simple TPC-H query (It’s query #3, if you’re interested). 

The data model is simple: 

  • Customers: 15M rows sitting in a database on AWS Aurora (accessed as PostgreSQL)
  • Orders: 150M rows sitting in an S3 data lak
  • Order Line Items: A smidgen over 600M rows in the same data lake

More volatile (and smaller) data is stored on an RDBMS where it can be accessed and updated easily, while the lake stores more static “big data” in a less expensive way. 

The Question: 

“Show revenue over time for Customers in the ‘BUILDING’ segment for any orders placed before 3/15/1995 and shipped after 3/15/1995”

Here is the original TPC-H SQL which represents the question:

Figure 1 - TPC-H Query #3

Ultimately, we want to see something like this, so the ORDER BY clause as expressed in the TPC-H query doesn’t really do anything for us since Tableau will be sorting by date anyway: 

Figure 2 - What we're building

OrderKey, which the TPC-H query also returns, isn’t really necessary either. We just need it for our JOINing. We want ALL the data, so LIMIT has to go, too.

Taking a Shot with Cross-Database Joins

To answer this question with CDBJs (a new acronym! Yay!), we use relationships like so: 

Figure 3 - Data Source Design

Pretty easy stuff.  The first connection in the data source hits Aurora/PostgreSQL directly using the PostgreSQL driver. The second connection uses Starburst to get tables on the lake. 

I added a Data Source filter on Market Segment to return only BUILDING-related customers from Aurora/PostgreSQL. We build out our Revenue measure in a simple way, too:

So let’s fire this viz up and see what happens!

 I expected relatively poor performance, but I was quite surprised by just how long it took to complete. Let’s let the Tableau Performance Recorder tell the story: 

  

Figure 4 - Tableau Performance Recorder Output

What you’re seeing in the animated viz above is Tableau executing seven queries (some in parallel) for a total of 297 minutes of query execution. The viz itself completed in 203 minutes, or just over three hours. What WHAT WHAT?

I ran it again. 

Figure 5 - Tableau Performance Recorder, redux

This time, I saw a mere 150 minutes of total query execution with an overall time-to-view of ~119 minutes – about an hour faster. 

What the hell is happening? Well, Tableau is downloading data. We knew that was going to happen, but at scale, ouch. 


…if you eyeball the queries in Figure 5, you can see Tableau grabbing PK and FK values which drive the inter-table relationships, as well as domain members. In the fifth query, it downloads all ~600m rows worth of the fields it needs from the Line Item table. 

I’d guess that the difference between view execution time probably had to do with my internet connection – I have ~300MB down/sec, but perhaps I was downloading some other stuff during the first run? Not sure. 

So there’s still good news here – I got my answer from a relatively large federated dataset. It did take longer than the average analyst would care to wait, however. 

And Now, for the Answer You’ve Been Waiting For

Starburst is really, really good at executing fast, smart federated queries. There is none better at interactive federated queries that utilize a data lake. Perfect for Tableau.

Realistically, to get the same answer Tableau arrived at, all we’d need to do is to use the query in Figure 1  “as-is”, substituting the name of our Aurora/PostgreSQL and Data Lake schemas for "${database}"."${schema}". Starburst would take care of federation for me, and I’d be done. 

Instead, I created an arguably UGLIER query to return ALL the columns in ALL joined tables just so a Tableau author could have access to “everything”, whether they needed it or not. This approach gives Tableau users the “lots of fields” options that they are used to. 

I created a View out of this query, and it looked like this: 

Figure 6 - View resident in Starburst

(By the way, I checked and found it made no difference whether the Market Segment = ‘BUILDING’ filter was applied INSIDE the CREATE VIEW statement as I did above, or directly in Tableau as a Data Source or Worksheet-specific filter)

Next, I created a Data Source which calls the view in question. Easy-peasy:

Figure 7 - Tableau Data Source Consuming View

Finally I recreated the Revenue expression and used order date and ship date for my filtering. Now we get to execute version two of the viz:

Figure 8 - Starburst Handling Query Federation

Yeah, you’re reading that right. Now Tableau issues a single query, gets results back in < 8 seconds, and the viz is rendered in about 11 seconds. Not bad, huh?

Now the skeptic in you is probably thinking one of several things: 

“But I don’t have access to Starburst and standing it up is hard”

“I bet he used some huge honking cluster and he’s totally cheating”

“Evil Sith Sorcery!”

None of these are true, and I’ll cover off on each in a second. First let’s look at what Starburst did under the covers, though: 

First, it grabs customers (customer keys, to be more precise) who have the correct market segment. This data lives in our database.  Check out Stage 5 in the diagram below:

Figure 9 - Query Planning

Next, in Stage 4, we pull orders which match the date criteria specified in Tableau. Note how Starburst is using dynamic filters to further limit the rows being retrieved from the data lake based on customer keys that originated in the database.

Our customers and orders are then joined…. 

Figure 10 - Query Planning

We now have a set of orders we know we care about, so, in Stage 3 we can ask our lake for a limited set of line items based on the order keys we now know actually matter. 

Figure 11 - Query Planning

The last bit of heavy lifting occurs in Stage 2. Because of the dynamic filtering that occurred in Stage 3, only ~324M rows (versus 600M) even need to be considered for the INNER JOIN that occurs between line items and orders. We also do a partial aggregate (SUM) by truncated date.

Figure 12 - Query Planning

I won’t bother with Stages 1 and 0, but ultimately, only five rows are returned to Tableau vs. hundreds-and-hundreds of millions. 

Next, on to Objection Handling!

But I don’t have access to Starburst and standing it up is hard!

Ever heard of Starburst Galaxy? It’s our new 100% SaaS product. We handle everything for you and you can both register and have clusters up and running in a couple of minutes. I kid you not. 

Did I mention we’ll give you $500 in usage credit? I forgot about that, maybe? So sorry. We will. You might just want to consider signing up here. You can stand up clusters sized for whatever question you need to answer. We’ll also let you run three “Free” size clusters on us. You’re welcome!

Figure 13 - Three Happy Clusters

I bet he used some huge honking cluster for this test and he’s totally cheating

Well, that would be pretty dumb since I just told you to try it out yourself! Trust, but verify! ☺

As you can see above, I was using a Medium cluster for my testing.  Using a Small cluster, I finished in 11 seconds, guess my Medium was overkill for running a single query with no other users on the system. Keep in mind though, in the real world you’ll likely have multiple folks running concurrent queries at any given time, so your mileage may vary.

Figure 14 - Small Cluster Performance Results

An Extra-Small got me there in 19 seconds, with some of that being the longer-than-usual connection time. 

Figure 15 - X-Small Cluster Performance Results

Going first-class with a Large did the job in 10s, so I wouldn’t spend the extra cash for 1s better response time. I didn’t bother playing with the X-Large and 2X-Large clusters. 

Figure 16 - Large Cluster Performance Results

The moral of the story here is YOU get to decide how many resources you want to apply to a problem based on how quickly you want your answer. Don’t need super-fast response time and have basic queries? Lean on Free or Small-ish clusters. Save your pennies. Bring out the big guns only when you need them.

Evil Sith Sorcery!

Sorcery, perhaps. But not evil. I peg Galaxy as Neutral Good. And our Mana is incredibly high.

 

Russell Christopher

Russell is the Director of Product Strategy at Starburst and really, really likes data.

Your Comments :

data-mesh-email-signature
Datanova 2022

From Facebook

Read more of what you like.