June 2014 – Database Fog Blog

In January Greenplum rolled out a new query optimizer. This is very cool and very advanced stuff.

Query optimization is a search problem… in a perfect world you would search through the space of all possible plans for any query and choose the least expensive plan. But the time required to iterate through all possible plans would take more time than most queries… so optimizers use rules to cut down the space searched. The rules have been built up over the years and are designed to prune the space quickly to keep performance high for simple queries. But these rules can break down when complex queries are introduced… so Greenplum made the significant investment to build a new optimizer from scratch.

Florian Waas, the leader of this program for Greenplum (now off on another venture) explained it to me this way. If the large rectangle in Figure 1 represents the total search space for a query, a modern query optimizer only searches the area in the small gray square… it looks for the best plan in that small space.

You may be surprised to learn that the optimizers used by every major DBMS product are single-threaded… they use only one core of a multi-core processor to search the space and produce a plan. There is no way to effectively search more with a faster single processor (even though you could search more the amount of time you spend as a percentage of the query execution time would stay the same… because the query execution would speed up as well)… so if the optimizer is to search more of the space it will have to use multiple cores and search the space in parallel… and this is exactly what Greenplum has accomplished.

The benchmark results for this are impressive (see here)… several queries in the TPC-DS suite run hundreds of times faster.

ORCA is available to early support customers now and the results map to the benchmark… some queries see an extreme performance boost, while others run significantly slower. This is to be expected from any first release optimizer.

But Greenplum have built another advanced technology into ORCA to reduce the time it will take to mature the software. ORCA includes AMPERe, an optimizer debugging facility that captures the state necessary to recreate problems and fix them. Together these capabilities: parallel search and specialized debugging have advanced the state of the art significantly.

What does it mean to you? It will take some time to shake out ORCA… and HAWQ is still very slow when compared to other analytic databases… and very very slow when compared to the in-memory databases available… and in-memory products like Spark are coming to the Hadoop eco-system. But at the price point HAWQ is a bargain. If you need an inexpensive batch engine that crunches numbers offline then in the next year, as ORCA matures, it may be worth a look.

As a side note… this topic introduces one of the issues related to in-memory databases… when even a very complex query completes with a sub-optimal plan in under a second how much time can you spend searching the plan space? I suspect that applying the parallel optimization principles developed by the Greenplum team will yield similar or even better improvements for in-memory… and these techniques will be a requirement very soon in that space.

References

PivotalGuru: Greenplum Database 4.3
Theory in Practice: How to Build a Query Optimizer for Big Data
SIGMOD 2009: Parallelizing Extensible Query Optimizers

This post will consider the implications of a full database federation as would be required by a Logical Data Warehouse. I’ll build on the concepts introduced in the posts on RDBMS-Hadoop integration (Part1, Part 2, Part 3, Part 4, Part 5, Part 6, Part 7, Part 8).

Figure 1 summarizes those earlier concepts from simple to advanced.

But the full federation required to implement a logical data warehouse requires a significant step up from this. Simple federation will be a disaster and Basic federation will not be much better. Here is why.

Let’s add a database and use Figure 2 to consider the possibilities when we submit a query that joins Table A.One to A.Two to B.One to C.One. Note that in this picture we have included a Governor to execute the federated queries that is independent of any of the DBMSs… this is the usual case for federation.

In the simple case where the Governor executes the entire plan all of the data must come to the Governor. This is clearly unacceptable. Consider the worse case where a SELECT is issued against only one table… still all of the data must bubble up.

In the Basic case the problem is partially mitigated… less data moves after the predicates are resolved but the overhead will still kill query performance. A Governor with basic capabilities provides the minimal features to make this work. It is useful where slow federation is better than data replication… but that is about all.

However, the advanced case becomes seriously more complicated. The optimizer now has to decide if table B.One should move to C to join the data or should it move to A… or should it move data to the Governor.

The problem is further complicated by any resource shortage on any node or any functional capability differences. If the cost of data movement would suggest moving B data to C… but there is no CPU resource available on C then maybe a different decision should be made? If C.One is a big table but C is a column-store and the cost of the SELECT is small because a minimum of columns are required and the cardinality of those columns is small so the data might be fetched from the dictionary then we might make a different decision. If B is a fast in-memory database but there is no memory available then the cost changes. Finally, if there are twenty databases in your logical DW then the problem increases exponentially.

The point here is clear… data federation over n-tiers is a hard problem. There will be severe performance issues when the optimizer picks wrong. This is why the independent governor model is so attractive… Many of the variables around CPU resources and database capabilities are removed… and while the performance will be poor it will be predictably poor. You should consider the implications carefully… it is just not clear that a high-performance logical data warehouse is feasible simply laid over an existing architecture. And if you build on a model with a Governor you must be sure that the Governor, and the Provincial databases can handle the load. I suspect that the Governor will have to run on a cluster and use a shared-nothing architecture to handle a true enterprise-sized logical EDW.

HANA has a twist on this that is interesting. The Governor lives inside one of the database nodes… so for data in HANA there is no data movement cost unless the optimizer decides to send the data to another node. Further, HANA is very fast… and the performance will mitigate some of the slowness inherent in federation. Finally, HANA is a shared-nothing DBMS… so it is not a problem to move lots of data to HANA in support of big tables and/or thousands of concurrent queries.

I’ll try to use this thinking: simple, basic, advanced federation over some governed federator on a an in-memory or fast shared-nothing architecture to evaluate the products on the market that provide federation. This may prove interesting as the Logical Data Warehouse concept catches on and as products like Teradata’s QueryGrid come to market.

Month: June 2014

The Greenplum ORCA Optimizer

Logical Data Warehouses and the Basics of Database Federation

Share this:

Share this: