Part 6: How Hadooped is HANA?

Now for HANA plus Hadoop… to continue this thread on RDBMS-Hadoop integration (Part 1, Part 2, Part 3, Part 4, Part 5) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

As a preface I need to include a note/apology. As you will see HANA may well have the best RDBMS-Hadoop integration in the market. I try hard not to blow foam about HANA in this blog… and I hope that the objective criteria I have devised to evaluate all of the products will keep this post credible… but please look at this post harder than most and push back if you think that I overstep.

First… surprisingly, HANA’s first release has only a single pipe to the Hadoop side. This is worrisome but easily fixed. It will negatively impact performance when large tables/files have to be moved up for processing.

But HANA includes Hadoop as a full partner in a federated data architecture using the Smart Data Access (SDA) engine inside the HANA address space. As a result, HANA not only pushes predicates but it uses cost-based optimization to determine what to push down and what to pull up. HANA interrogates the Hadoop system to gather statistics and uses the HANA optimizer to develop smart execution plans with awareness of both the speed of in-memory and the limited memory resources. When data in HANA is joined with data in Hadoop SDA effectively uses semi-joins to minimize the data pulled up.

Finally, HANA can develop execution plans that executes joins in Hadoop. This includes both joins between two Hadoop tables and joins where small in-memory tables are pushed down to execute the joins in Hadoop. The current limitation is that Hadoop files must be defined as Hive tables.

Here is the HANA execution plan for TPC-H query 19. HANA has pushed down all of the steps behind the Remote Row Scan step… so in this case the entire query including a nested loop join was pushed down. In other queries HANA will push only parts of the plan to Hadoop.

TPCH Q19 Plan

So HANA possesses a very sophisticated integration with Hadoop… with capabilities that minimize the amount of data moved based on the cost of the movement. This is where all products need to go. But without parallel pipes this sophisticated capability provides only a moderate advantage (see Part 5),

Note that this is not the ultimate in integration… there is another level… but I’ll leave some ideas for extending integration even further for my final post in the series.

Next… Part 7… considering Greenplum…

Part 5: A Review of Processing Push-down

Continuing this thread on RDBMS-Hadoop integration (Part 1, Part 2, Part 3, Part 4) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

I want to be sure that I’ve conveyed the concepts behind these criteria properly… I may have rushed it in the early parts of this series.

Let’s imagine a query that joins a 2,000,000 row table with a 1000 row dimension table where both live in HDFS.

If all of the data has to be moved from HDFS to the RDBMS then 2,001,000  rows must be read and moved in order to apply a predicate or any other processing.. For fun lets say that the cost of moving this data is 2001K.

If there are 10 parallel pipes then the data movement is completed in one tenth the time… so the cost is 200K.

If a predicate is included that selects only 5% of the data from the big table, and the predicate is pushed down the cost is reduced to 101K. Add in parallel pipes and the cost is 10K

Imagine a query where there is a join between the two tables with predicates on one side and predicate push down… then you have to pay 101K to pull the projected data up and do the join in the RDBMS. If there is a join predicate that reduces the final answer set by another 95% then after the join you return 6K rows. Since everybody returns the same 6K rows as an answer we won’t add that in.

But if you can push the join down as well as the predicates then only 6K rows are moved up… so you can see how 2001K shrinks to 6K through the effective push down of processing.

Further, you can build arbitrarily complex queries and model them pretty well knowing that most of the cost is in data movement.

So think about how Teradata processes these two tables in Hadoop when you use the specialized SQL constructs and then again if you build the query from a BI tool. And stay tuned as I’ll show you how HANA processes the data next…. and then talk about several others.

On to Part 6

Part 4: How Hadooped is Teradata?

In this thread on RDBMS-Hadoop integration (Part 1, Part 2, Part 3) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

Let’s consider the Teradata SQL-H implementation using these criteria.

First, Teradata has effective parallel pipes to move data from HDFS to the Teradata database with one pipe per node. There does not seem to be any inter-node IO parallelism. This is a solid feature.

There is a limited ability to push down predicates… SQL-H does allow data to be partitioned on the HDFS side and it will perform partition elimination if the query explicitly calls out a predicate within a partionfilter() keyword. In addition there is an ability to project out columns using a columns() keyword to explicitly specify the columns to be returned. These features are klunky but effective. You would expect partitions to be eliminated when the partitioning column is referenced with a predicate in the query like any other query… and you would expect columns to be projected out if they are not referenced. Normal SQL predicates are applied after the data is moved over the network but before every record is written into the Teradata database.

Finally SQL-H provides no advanced capabilities to push down join operators or other functions.

The bottom line: SQL-H is a sort of klunky implementation, requiring non-ANSI-standard and non-Teradata standard SQL syntax. Predicate push down is limited but better than nothing. As you will see when we review other products, SQL-H is a  basic offering. The lack of full predicate push-down and advanced features will negatively and severely impact performance when accessing large volumes of data, Big Data, and the special SQL syntax will limit the ability to access HDFS data from 3rd party tools. This performance penalty will force customers to pre-join and pre-aggregate data in Hadoop rather than access it naturally.

Next Part 5...

References

Teradata Magazine: Hands On Dynamic Access

Doug Frazier: SQL-H Presentation