Now for HANA plus Hadoop… to continue this thread on RDBMS-Hadoop integration (Part 1, Part 2,Part 3,Part 4, Part 5) I have suggested that we could evaluate integration architecture using three criteria:
How parallel are the pipes to move data between the RDBMS and the parallel file system;
Is there intelligence to push down predicates; and
Is there more intelligence to push down joins and other relational operators?
As a preface I need to include a note/apology. As you will see HANA may well have the best RDBMS-Hadoop integration in the market. I try hard not to blow foam about HANA in this blog… and I hope that the objective criteria I have devised to evaluate all of the products will keep this post credible… but please look at this post harder than most and push back if you think that I overstep.
First… surprisingly, HANA’s first release has only a single pipe to the Hadoop side. This is worrisome but easily fixed. It will negatively impact performance when large tables/files have to be moved up for processing.
But HANA includes Hadoop as a full partner in a federated data architecture using the Smart Data Access (SDA) engine inside the HANA address space. As a result, HANA not only pushes predicates but it uses cost-based optimization to determine what to push down and what to pull up. HANA interrogates the Hadoop system to gather statistics and uses the HANA optimizer to develop smart execution plans with awareness of both the speed of in-memory and the limited memory resources. When data in HANA is joined with data in Hadoop SDA effectively uses semi-joins to minimize the data pulled up.
Finally, HANA can develop execution plans that executes joins in Hadoop. This includes both joins between two Hadoop tables and joins where small in-memory tables are pushed down to execute the joins in Hadoop. The current limitation is that Hadoop files must be defined as Hive tables.
Here is the HANA execution plan for TPC-H query 19. HANA has pushed down all of the steps behind the Remote Row Scan step… so in this case the entire query including a nested loop join was pushed down. In other queries HANA will push only parts of the plan to Hadoop.
So HANA possesses a very sophisticated integration with Hadoop… with capabilities that minimize the amount of data moved based on the cost of the movement. This is where all products need to go. But without parallel pipes this sophisticated capability provides only a moderate advantage (see Part 5),
Note that this is not the ultimate in integration… there is another level… but I’ll leave some ideas for extending integration even further for my final post in the series.
Continuing this thread on RDBMS-Hadoop integration (Part 1, Part 2,Part 3,Part 4) I have suggested that we could evaluate integration architecture using three criteria:
How parallel are the pipes to move data between the RDBMS and the parallel file system;
Is there intelligence to push down predicates; and
Is there more intelligence to push down joins and other relational operators?
I want to be sure that I’ve conveyed the concepts behind these criteria properly… I may have rushed it in the early parts of this series.
Let’s imagine a query that joins a 2,000,000 row table with a 1000 row dimension table where both live in HDFS.
If all of the data has to be moved from HDFS to the RDBMS then 2,001,000 rows must be read and moved in order to apply a predicate or any other processing.. For fun lets say that the cost of moving this data is 2001K.
If there are 10 parallel pipes then the data movement is completed in one tenth the time… so the cost is 200K.
If a predicate is included that selects only 5% of the data from the big table, and the predicate is pushed down the cost is reduced to 101K. Add in parallel pipes and the cost is 10K
Imagine a query where there is a join between the two tables with predicates on one side and predicate push down… then you have to pay 101K to pull the projected data up and do the join in the RDBMS. If there is a join predicate that reduces the final answer set by another 95% then after the join you return 6K rows. Since everybody returns the same 6K rows as an answer we won’t add that in.
But if you can push the join down as well as the predicates then only 6K rows are moved up… so you can see how 2001K shrinks to 6K through the effective push down of processing.
Further, you can build arbitrarily complex queries and model them pretty well knowing that most of the cost is in data movement.
So think about how Teradata processes these two tables in Hadoop when you use the specialized SQL constructs and then again if you build the query from a BI tool. And stay tuned as I’ll show you how HANA processes the data next…. and then talk about several others.
In the thread I’m now working on (Part 1, Part 2) I’ve talked about Exadata and a criteria for evaluating split architectures. It is worth quickly talking about homogeneous systems with no split like Teradata or Greenplum or HANA… systems with no separate query-capable top end.
Every parallel shared-nothing DBMS has big pipes to move data between nodes… and they have all of the advanced intelligence to reduce the amount of data moved both between nodes within a query.
So if we compare Exadata to a split system with an RDBMS and Hadoop it fairs very well. But if we compare Exadata to Teradata or Greenplum or HANA… then the bottlenecks in Exadata look more severe. Exadata may tie, or occasionally win, in a POC against these homogenous competitors when the queries fit their architectural sweet spot. But if any queries are included that expose the bottlenecks or the limitations in query push-down… Exadata’s weak split architecture shows.
In my blog yesterday (Part 1) I suggested that we could evaluate RDBMS-Hadoop integration architecture using three criteria:
How parallel are the pipes to move data between the RDBMS and the parallel file system;
Is there intelligence to push down predicates; and
Is there more intelligence to push down joins and other relational operators?
But Exadata is a split RDBMS with a parallel file system backing it… how does it measure up by these criteria?
There are effective parallel pipes between the Oracle RAC RDBMS and the Exadata Storage Subsystem… so Exadata passes the first test. Further, Exadata is smart about pushing scan and projection both down to the Storage layer.
Unfortunately there is a fairly severe imbalance between the number of nodes on the RAC side and the number of nodes on the Storage side and this creates a bottleneck. We cannot give Exadata full marks here… but as far as parallel pipes goes it stacks up pretty well.
The ability to push down predicates goes a long way towards solving this as the predicate push-down reduces the amount of data that has to move over the bottleneck. But in every data warehouse there will be queries that return lots of rows from the early execution steps… and Exadata cannot join data in the Storage Subsystem so it tries to pull data up sparingly and push down semi-joins whenever possible… it just cannot be done in every case (Note: in Exadata POCs Oracle will try to ensure that no queries are included that pull lots of data up to the RAC layer… and competitors will try to include queries that expose this weakness…).
So… Oracle also includes some intelligence to push some data down to reduce data movement. There is no way to choose to move data from the RAC layer to the Storage Subsystem and execute the query there… the Storage Subsystem can only scan and project… so again we cannot give Exadata full marks… but it is pretty smart as you will see when we start looking at alternative implementations.
Finally, Exadata cannot effectively split a single query plan across both layers… so no marks at all here.
So Exadata is pretty good… but it has weak spots that will be severe for an important set of DW queries in any implementation.
The next few blogs will try to evaluate the different approaches to integrating Hadoop and a standard RDBMS… so the first thing I’ll try in this post is to suggest a criteria based on some architectural choices for making the evaluation. Further, I’ll inject a little surprise and make the point by using the criteria to say something about a product that is not an integration of an RDBMS and Hadoop.
For the purposes of this let me clear that by “Hadoop” I mean at least HDFS plus MapReduce… so I will discuss integrating a parallel RDBMS with data stored in HDFS: a massively parallel file system with a programming capability included. By “integration” I mean that queries using the full set of SQL supported by the RDBMS must be available for processing queries that refer to data across the Hadoop-RDBMS divide.
Since we’ve assumed that all SQL functionality is supported the architectural issue left to solve is performance and this issue revolves on one topic: how do we minimize the cost of moving data between the two partners for a given query?
Now to get on with it…
The easiest, but not all that easy, problem involves using parallelism to move data from one system to the other… so the first criteria we will evaluate for each product will consider how parallel is their movement of data.
The next criteria involves intelligence in the RDBMS to push down some execution operators to the data layer. Of course the RDBMS must scan remote data… so in this part of the evaluation we will grade each product’s ability to push processing down to apply predicates and project the minimal amount of data up to the RDBMS.
Finally, a most intelligent product would push more than just predicates down… it would push down joins and aggregation… and the decisions around splitting processing would be fully optimized. A most intelligent product would fully federate the HDFS data into the RDBMS.
So there you have it… I will start evaluating RDBMS-Hadoop architecture by three criteria:
how parallel is the data movement between the RDBMS and Hadoop;
is there intelligence to minimize data movement by pushing the least data and the associated query plan to one system or another… this requires parallel pipes in both directions; and
is there intelligence to build an optimal query plan that splits steps across both systems to completely minimize the movement of data and/or optimize the compute.
And a final word on the relative strength of each criteria:
If we imagine a 10-node Hadoop cluster talking to a 10-node RDBMS with 10 parallel pipes and compared it to the same setup with only 1 pipe (not parallel) then we might suggest that the parallel pipes provide a 10X performance increase.
If we imagine intelligence that moved 100K rows rather than 10M then we might suggest that intelligent push down might provide a 100X performance increase…
If we had even more intelligence and further optimized processing then another 10X-100X might be possible.
So all three criteria are not equal… intelligent query planning trumps wide pipes…
Now for the surprise… in the next blog we’ll look at how Exadata’s architecture maps to these criteria… since it is a two-tiered architecture with an RDBMS tied to a parallel file system…
“That’s not a knife… THAT’S A KNIFE” – C. Dundee (Photo credit: Wikipedia)
Michael Stonbreaker has suggested several times… and again in this interview… that databases will become more specialized and that “one size will fit none”. I’m sure that his argument is more nuanced than the sound bites in the interview, but in this post I’ll suggest a line of thinking that may lead to a different conclusion.
First, let’s agree that the word “fit” means the best price for the performance required to meet your company’s service level requirements.
Then let’s agree with the basic premise behind Dr. Stonebreaker’s argument… we agree that in any single-purpose application a specialized single-purpose DBMS can be developed that will out perform a generalized DBMS. This means that one-half of the fit, performance, is likely.
We would also agree that between the growth of open source databases and the general growth in the database space that it is likely that someone can and will develop specialized databases and bring them to market in cases where there is enough market to make it worthwhile. But it is important to note that specialization will be not become infinitely narrow… there has to be enough market for the special case to generate an attractive product.
So where is the disagreement: I do not believe that data is ever used in a single specialized business context. Not ever.
Let’s imagine that we have a business requirement for an extremely high volume OLTP application… and let us assume that the performance and/or scalability requirements are beyond what any general DBMS can provide and that the business ROI is significant… in other words, let us imagine that we are Google or Facebook. In this case we have no choice but to select or to develop an extreme, specialized, DBMS to solve the problem and extract the return.
But in this case, once the OLTP transactions are recorded… what do we do with the data? We need to use the data elsewhere in the business as the basis for deep analytics or for basic business intelligence… so we have to replicate the data to a second database. Since the second DBMS is also sort of specialized… it does not have to support OLTP… we select a second specialized, data warehousish product.
And then come new requirements for doing light queries in near real time to support operational analytics… so we build some sort of operational data store. Again we can select a product with a narrow technical sweet spot… but we have to replicate the data a third time.
In other words… given my premise… that data is never used in a single specialized context… specialized databases force replication… and replication allows for further specialization.
But what if the requirements are not so extreme? Then we might use a single conventional RDBMS for the EDW and for the ODS problems. If a more generalized DBMS product exists that could handle both the operational reporting and the analytic reporting requirements in a single image of the data then we could eliminate one replica and one DBMS. In other words, if the problem is not so extreme then a generalized solution might provide a solution in a single instance of the data avoiding replication.
Now the issue becomes: is the cost of a specialized system plus replication plus a second specialized system plus the cost of operating these systems less than the cost of a single generalized system? I believe that the answer will often be in favor of a single system even when the specialized systems are low-cost open source. Since “fit” is about cost maybe one size does fit now and again.
This suggests the strategy of the OldSQL vendors. They are offering a Swiss Army Knife product that serves multiple requirements. Their feature sets have grown over 30 years and they are pretty capable across a wide array of business problems… and with the columnar and in-memory features being added they continue to cover ever more extreme uses cases… not the most extreme use cases… but they cover more ground each year.
The strategy of the NewSQL vendors is to focus tight and hard. They might develop an extreme OLTP DBMS with no ability to do a join… a product with extreme scalability and no performance… or a graph database to solve for an important, narrow, set of queries… or a columnar product that performs analytics but support no OLTP. This trend feeds the specialize and replicate meme advocated by Dr. Stonebreaker.
HANA is a horse of a different color… neither NewSQL nor OldSQL. It is a new code base designed to solve for a very wide set of uses cases in a single instance of the data. We certainly agree in this blog with Dr. Stonebreaker’s contention that the 30 year old legacy code base has to be retired. But SAP contends that you can build a new, generalized, DBMS that solves for all but the most extreme cases.
This is a great spot to end the year… having laid out the battle we will cover in this blog ongoing… with the legacy OldSQL vendors trying to tack on to their legacy code base… and doing pretty well at it… with the NewSQL vendors trying to specialize and replicate… and with HANA offering a new code base designed to solve for the the whole picture. 2014 will be great fun to watch. This also sets the stage to ask next year whether “big data” applications are so extreme as to force users to specialize-replicate-specialize.
Have a great holiday season… and my best wishes. Thank you all for reading the Database Fog Blog in 2013… I hope for your continued attention in the New Year…
In this blog I have stated explicitly and implied now and again that the big architectural features are what count… despite the fact that little features are often what are marketed. Here is a true story to reinforce this theme… and a reminder of the implications… a real-life battle between two vendors: we’ll call them NewVendor and LegacyVendor.
Four years ago, more or less, NewVendor sold a system to offload work from an existing LegacyVendor configuration. Winning the business was tough and the POC was a knife-fight. At that time the two vendors were architecturally similar with no major advantages on either side. In the end NewVendor won a fixed contract that provided 16 nodes and guaranteed to match the performance of LegacyVendor for a specific set of queries. The 16 node configuration was sized based on the uncompressed data in LegacyVendor’s system.
NewVendor sent in a team to migrate the data and the queries… what was expected to be a short project. But after repeated attempts and some outside effort by experts the queries were running 50% slower than the target. I was asked to have a look and could see no glaring mistakes that could account for such a large performance miss… I saw no obvious big tuning opportunities.
After a day or so of investigation I found the problem. LegacyVendor offered a nice dictionary-based compression scheme that shrunk the size of the data by… you guessed it… exactly 50%. Because the NewVendor solution had to read 50% more data with each query they were 50% slower. I recommended that NewVendor needed to supply eight more nodes to hit the performance targets.
In the course of making these recommendations I was screamed at, literally, by one technology executive and told that I was a failure by another. They refused to see the obvious, exact, connection between the performance and the compression. I was quickly replaced by another expert who spent four months on site tuning and tuning. He squeezed every last drop out of the database going so far as to reorder columns in every table to squeeze out gas where data did not align on word boundaries. His expert tuning managed to reduce the gap and in the end NewVendor purchased three fewer nodes than my recommendation. With those nodes they then hit the targets. But the cost of his time and expense (he was a contractor) exceeded the cost of the nodes he saved… and the extra four-month delay antagonized the customer such that the relationship never recovered.
In a world where the basics of query optimization and execution are known to all there are only big-ticket items that differentiate products. When all of the big-ticket, architectural, capabilities are the same the difference between any two mature RDBMS products will rarely be more than 10%-15% across a large set of queries. The big-ticket differentiators today are the application of parallelism, compression and column store, and I/O avoidance (i.e. in-memory techniques). The answer to the question who out-performs who can be found to a close approximation from looking at who is how parallel (here) and who is how columnar (here) and merging the two… with a dose of who best avoids I/O through effective use of memory. This is the first lesson of my story.
The second lesson is… throw hardware at tuning problems when there are no giant architectural mistakes. Even a fat server node costs around $15K… and you will be better off with faster hardware than with a warehouse or mart that is so finely tuned and fragile that the next change to the schema or the data volumes or the workload breaks it.
Epilogue
Soon after this episode NewVendor rolled out a Level 2 columnar feature. This provided them with a distinct advantage over LegacyVendor… an advantage almost exactly equal to their advantage in compression plus the advantage from columnar projection to reduce I/O… and for several years they did not lose a performance battle to LegacyVendor. Today LegacyVendor has a comparable capability and the knife-fight is on again… Architecture counts…
My friend and a major contributor to the art of database architecture, Chuck McDevitt, died last week. Five years ago Chuck was diagnosed with an advanced cancer and given six months to live. He fought and endured and worked for most of those five years, teaching us all a little about how we might live our lives.
Chuck’s contributions to database architecture are not so well known. He was employee number fourteen at Teradata and developed, with Dan Holle, the Teradata version of SQL. Chuck invented several foundational parts of any parallel database system. He left Teradata and went to Cogit, a start-up that developed a very early parallel data mining tool. From there Chuck went to Ab Inito as a senior architect, and from there he went to Greenplum where he was the Chief Architect.
At Greenplum Chuck was the brains behind the development of their parallel version of Postgres. It is significant to note that Chuck’s architectural insights led to an extensible and powerful implementation that far exceeded the efforts of others trying to accomplish the same result from a Postgres starting point: Aster Data, Netezza, DataAllegro, and the Postgres community.
To convey Chuck’s contribution to Teradata let me tell a story.
After E.F. Codd published his ground-breaking papers on relational theory two research projects began to develop relational databases. The UC Berkeley project developed a query language that was called QUEL which was followed by an IBM Research project and a query language called Sequel. As Teradata entered the scene they developed a third query language that was called TEQUEL, selecting the best features of each. IBM then delivered DB2 and SQL/DS, and the Oracle Database appeared, all based on Sequel which was by then called SQL. QUEL was forgotten as SQL dominated the market leading to it becoming an ANSI standard in 1986.
One Friday in 1985, the Teradata sales management convened a meeting to request that Teradata support SQL in order to effectively compete in the emerging RDBMS market. They were told by Product Management that it would take three years to develop SQL support… that it was out of the question. Further, the Teradata CTO argued that TEQUEL was superior to SQL and that this was an advantage in any case.
Dan Holle heard about the meeting, called Chuck in and, convinced that the Teradata management team was headed in the wrong direction, started to work. What the technical Execs must have known, but clearly did not appreciate, was that Dan and Chuck had developed the Teradata database using what was at the time, a very advanced concept… compiling the TEQUEL query language into an intermediate language and then processing that. Working non-stop through the weekend Chuck developed an SQL parser that would generate the proper intermediate code for consumption by the optimizer and execution engine… and on Monday morning he demonstrated a functioning SQL version of Teradata.
It is fair to say that, but for Chuck McDevitt, Teradata would not exist… and it is likely that Greenplum would not exist.
As I repeatedly suggest in this blog… architecture counts… and you should all know that the database community lost a great architect last week. Chuck will be missed.
Column base in Trajan’s Forum in Rome. (Photo credit: Wikipedia)
This post will tie up some loose ends around columnar processing and consider some other players I left out. Part 1 and Part 2 of this series may be found here and here, respectively.
Other DBMS
Sybase IQ
How could I forget Sybase IQ? IQ is a mature Level 3 column store with vector optimizations… and several other advanced index optimizations.
Infobright is a level 1 column store. That is, they use column orientation to achieve compression but then store multiple columns in a data pack. The cool thing they do is that they capture detailed statistics for each data pack which allows them to answer some queries from the stats… and allows them to retrieve and decompress only packs that seem relevant to the query. It is sort of like IBM’s skip processing… only with more intelligent skipping.
SQL Server provides a column-oriented index over base tables. This is a non-updateable redundant data structure, not a columnar DBMS… but it acts like a Level 2 column store for query processing. That is, only the columns required for a query are projected/read.
Vectorwise uses PAX and is a Level 1 column store in this regard. It provides a MINMAX index that performs the same function as a zone map or skip processing from IBM. It contains a fundamentally different execution engine than other PAX systems… that is it is not row/tuple based… but it is not a columnar engine. Although it sounds odd I believe that the Vectorwise architecture has the advantages of a Level 1 column store and some of the advantages of a Level 3 column store without the advantages of a Level 2 column store.
First, you should be able to see clearly from the posts why OLTP is poor for column stores… the data comes in as a row and has to be decomposed into a bit mapped, compressed, column. This is like incurring the overhead for an index update and compression at the same time. The same problem has to be overcome for data loading.
The paper referenced in Part 1 on C-Store outlined an approach to get around the issue for data loading… and Vertica is the commercial offspring of the C-Store project so it uses this approach. But the C-Store tuple-mover architecture still includes too much overhead for OLTP due to constraints tied to I/O… so HANA implemented the approach in-memory which results in a single table that supports both OLTP and analytics… supporting both write-optimized and read-optimized workloads.
JOINs
Level 1 and Level 2 columnar databases join using a row engine… that is they convert columns to rows and then join. This misses some powerful opportunities for optimization from joining native column structures.
Hybrid systems like HANA and DB2 BLU support a mixture of row and column tables and allow joins across the types. Note that early materializing, Level 2 column stores like Teradata and Greenplum also support a mix… but they materialize the columns as rows early. DB2 BLU currently converts column-oriented tables to a row orientation and joins using the row engine. In this case BLU operates like a Level 2 maturity column store. I would expect that this will change over time.
The HANA optimizer has some limited ability to join mixed tables natively. Otherwise the optimizer has the ability to decide whether to convert row-oriented tables to column orientation or to convert column-oriented tables to rows. Note that with support for a single table for both OLTP and analytics HANA users typically deploy only columnar tables and avoid orientation translation during runtime.
Analytics
If you materialize rows late in a Level 3 mature column store then you get significant performance benefits. Using SIMD instructions provides an 8X performance advantage over using conventional ADD instructions. Super-computing vector processing provides at least that much again. Loading compressed columns instead of rows reduces CPU stalls to almost nothing providing another 2X boost… and getting data from the cache instead of accessing DRAM provides a 5X-20X (average of 15X) improvement. These benefits are why BLU and HANA are so compelling when compared to early materializing, L2, implementations like Teradata and Greenplum.
Other Points
Note that HANA does not gain a huge I/O advantage from columnar compression… this is because HANA only performs I/O on a large-scale at startup. The gains from compression for an IMDB come through efficient use of memory as compared to a Level 2, early materializing, columnar implementation.
IBM has criticized the HANA delta store/OLTP+Analytics implementation (can’t find the post for some reason… but I know that I read it)… but I do not see the logic behind the criticism? Row orientation works best for OLTP and one way or the other you have to change the orientation if you want the benefits of columnar…
In my last post here I suggested that there were three levels of maturity around column orientation and described the first level, PAX, which provides columnar compression. This apparently is the level Exadata operates at with its Hybrid Columnar Compression.
In this post we will consider the next two levels of maturity: early materialized column processing and late materialized column processing which provide more I/O avoidance and some processing advantages.
In the previous post I suggested a five-column table and depicted each of those columns oriented on disk in separate file structures. This orientation provides the second level of maturity: columnar projection.
Imagine a query that selects only 4 of the five columns in the table leaving out the EmpFirst column. In this case the physical structure that stores EmpFirst does not have to be accessed; 20% less data is read, reducing the I/O overhead by the same amount. Somewhere in the process the magic has to be invoked that returns the columns to a row orientation… but just maybe that overhead costs less than the saving from the reduced I/O?
Better still, imagine a fact table with 100 columns and a query that accesses only 10 of the columns. This is a very common use case. The result is a 9X reduction in the amount of data that has to be read and a 9X reduction in the cost of weaving columns into rows. This is columnar projection and the impact of this far outweighs small advantage offered by PAX (PAX may provide a .1X-.5X, 10%-50%, compression advantage over full columnar tables). This is the advantage that lets most of the columnar databases beat Exadata in a fair fight.
But Teradata and Greenplum stop here. After data is projected and selected the data is decompressed into rows and processed using their conventional row-based database engines. The gains from more maturity are significant.
The true column stores read compressed columnar data into memory and then operate of the columnar data directly. This provides distinct advantages:
Since data remains compressed DRAM is used more efficiently
Aggregations against a single column access data in contiguous memory improving cache utilization
Since data remains compressed processor caches are used more efficiently
Since data is stored in bit maps it can be processed as vectors using the super-computing instruction sets available in many CPUs
Aggregations can be executed using multiplication instead of table scans
Distinct query optimizations are available when columnar dictionaries are available
Column structures behave as built-in indexes, eliminating the need for separate index structures
These advantages can provide 10X-50X performance improvements over the previous level of maturity.
Summary
Column Compression provides approximately a 4X performance advantage over row compression (10X instead of 2.5X). This is Column Maturity Level 1.
Columnar Projection includes the advantages of Column Compression and provides a further 5X-10X performance advantage (if your queries touch 1/5-1/10 of the columns). This is Column Maturity Level 2.
Columnar Processing provides a 10X+ performance improvement over just compression and projection. This is Column Maturity Level 3.
Of course your mileage will vary… If your workload tends to touch more than 80% of the columns in your big fact tables then columnar projection will not be useful… and Exadata may win. If your queries do not do much aggregation then columnar processing will be less useful… and a product at Level 2 may win. And of course, this blog has not addressed the complexities of joins and loading and workload management… so please do not consider this as a blanket promotion for Level 3 column stores… but now that you understand the architecture I hope you will be better able to call BS on the marketing…
Included is a table that outlines the maturity level of several products: