An Elastic Shared-Nothing Architecture

In this post we will consider again the implications of implementing a shared-nothing architecture in the cloud. That is, we will start wondering about how to extend a static shared-nothing cluster deployed into an elastic hardware environment.


This is the first of three posts inspired by a series of conversations with the folks at Bityota (Bityota.com). After seeing the topics they asked if they could use the content in their marketing… so to be transparent… this is sort of a commercial post… but as you will see there is no promotional foam in the narrative.

– Rob


There is an architectural mismatch between Cloud Computing and a shared-nothing architecture.

In the Cloud: compute, processors and memory, scale independently of storage, disk and I/O bandwidth. This independence allows for elasticity: more compute can be dynamically added with full access to data on a shared disk subsystem. Figure 1 shows this relationship and depicts the elasticity that makes the Cloud so compelling.

Figure 1. Elastic Compute

In a shared-nothing architecture, compute and storage scale together as shown in Figure 2. This tight connection ensures that I/O bandwidth, the key to read performance, is abundant. But, in the end scalability is more about scaling I/O than about scaling compute. And this fact is due to the imbalance Moore’s Law injects into computer architecture… compute performance has far outstripped I/O performance over the years creating an imbalance.

Figure 2. Shared-nothing Bundles

To solve for this imbalance database engineers have worked very hard to avoid I/O. They invented indexing and partitioning and compression and column-store all with the desire to avoid I/O. When they could not avoid I/O they worked hard to minimize the cost by pre-fetching data into memory and, once fetched, by keeping data in memory as long as possible.

For example, one powerful and little understood technique is called the data flow architecture. Simply put data flow moves rows through each step of a query execution plan and out without requiring intermediate I/O. The original developers of Postgres, Sybase, SQL Server, Teradata, DB2, and Oracle did not have enough memory available to flow rows through so they spill data to the storage layer in between each step in the plan. Figure 3 shows how classic databases spill and Figure 4 shows how a more modern data flow architecture operates.

Figure 3. Classic Query Plan
Figure 4. Data Flow Query Plan

Why is this relevant? In a classic RDBMS the amount of I/O bandwidth available per GB of data is static. You cannot add storage without redistributing the data. So even though your workload has peaks and valleys your database is bottlenecked by I/O and this cannot flex. In a modern RDBMS most of the work is performed in memory without intermediate I/O… and as we discussed, compute and memory can elastically flex in a Cloud.

Imagine an implementation as depicted in Figure 5. This architecture provides classic static shared-nothing I/O scalability to read data from disk. However, once the read is complete and a modern data flow takes over the compute and memory is managed by a scalable elastic layer. The result is an elastic shared-nothing architecture that is well suited for the cloud.

Figure 5, Flowing to a Separate Compute Node

In fact you can imagine how this architecture might mature over time. In early releases a deployment might look like Figure 5 where the advantage of the cloud is in devising a cost-effective flexible configuration. As the architecture matures you could imagine a cloud deployment such as in Figure 6 where the 1:1 connection between storage nodes and compute nodes is broken and compute can scale dynamically with the workload.

Figure 6. Elastic Compute on a Shared-nothing Architecture

Cloud changes everything and it will significantly change database systems architecture.

It is strange to say… but the torch that fires innovation has been passed from the major database vendors to a series of small start-ups. Innovation seems to occur exclusively in these small firms… with the only recent exception being the work done at SAP on HANA.

How DBMS Vendors Admit to an Architectural Limitation: Part 1 – Oracle Exadata

Database vendors don’t usually admit to shortcomings… they protest that they have no shortcomings until the market suggests otherwise… then they make some sort of change that signals an admission. This post will explore three of these admissions: Oracle and the shared-nothing architecture, DB2 on the mainframe and the shared-nothing architecture, and Teradata and in-memory processing.

For years Oracle verbally thrashed Teradata in the market… proclaiming that the shared-nothing architecture was bunk. But in the data warehousing space Teradata acquired a large chunk of the market; and more importantly, they won more business as the size of the data warehouses grew. The reason for this is two-fold: the shared-nothing architecture lets you deliver more I/O bandwidth to the problem… and once you have read the disk it provides scalability to deliver more compute to process complex queries.

Finally Oracle had enough and they delivered Exadata, a storage engine attached to the conventional Oracle RAC that provided shared-nothing I/O bandwidth to the biggest part of the problem… the full file scan of big fact tables. This was an admission that they had been wrong all along.

Exadata was a tack-on… not a fundamental redevelopment of the Oracle database engine. They used the 80/20 rule to quickly get something to market and stem the trickle of Oracle customers who were out of gas on RAC and headed to shared nothing products: Teradata, Netezza, and Greenplum.

This was a very smart move and it worked. Even though the 80/20 approach meant that there were a significant number of queries, the complex queries that needed to process large working sets to execute joins, Exadata solved enough of the problem to keep devout Oracle shops in the church. Only the shops who felt that complex query performance was important enough to warrant the cost of a migration (for an existing DW that had grown up) or the lesser cost of introducing a new technology (for a new DW) would move.

So, while Exadata was a smart move… it is a clear admission that shared-nothing is the right architecture for data warehouses and marts. This admission makes it clear that it is silly to build a warehouse or mart on normal Oracle or on RAC unless you consider your database an inviolable part of a technological creed.

In my opinion selecting a database is an engineering process that does not require orthodoxy… we should be strong enough engineers to pick the better technology and learn it. Being an “Oracle shop” is lazy.

Note that the in-memory technologies provided in Oracle12c are significant… and for warehouses and marts that will fit on a single node, 12c as it matures, will be a fine choice for the orthodox Oracle shop and for others. For bigger data applications you will require Exadata and the limitations that come with it.

This provides a nice transition to the Part 2 post on Teradata and in-memory.

Related Posts

Database Fog Blog
Other References

A Modern Data Warehouse Architecture: Part 3 – Build an EDW Annex

In the first two post of this series (here and here) I first suggested that Hadoop could be effectively used as the platform for staging and then suggested that a modern warehouse would have a federation layer that turned it into a logical data warehouse. Figure 3 depicts this extended architecture.

Figure 3. A Logical EDW

But if we have both Hadoop and a federation layer implemented… and we recognize the economics associated with moving data to Hadoop… Hadoop provides a 5X-50X price advantage over a commercial very large DBMS product… and we can move data from the expensive environment to the low-cost environment without impacting any applications… then we have the opportunity to move governed EDW data to Hadoop and place it into a Hadoop EDW Annex. Figure 4 shows this.

An EDW Annex

Now you might suggest that there is an impact… Hadoop will be significantly slower than a commercial EDW platform (for now…). But experienced EDW architects understand that in the classic architecture we had to co-locate data in a single database to join the data. So, we put all of the data, hot and cold data, in our EDW even though the service levels required for queries that touch old cold historical data did not justify the power and price of the EDW infrastructure. We had to but did not need to. We knew, if only implicitly, that most EDW queries touch a small subset of the data. Following the ratio suggested by Teradata (see the reference here) that 90% of the queries touch only 20% of the data we can imagine a system where 80% of the data resides in Hadoop to service 10% of the queries… and only that 10% experiences Hadoop performance.

I suggested this approach for Teradata here… but an architecture with an EDW Annex to store cleansed governed historical data works for any expensive RDBMS that can federate with Hadoop: Exadata, Netezza, Teradata, or HANA.

This concludes this series… sort of. I’ll post soon to express more about how this architecture provides long-term strategic value. I think that these three concepts: Hadoop as an EDW staging area, federation and logical data warehousing, and Hadoop as an EDW Annex; provide the foundation for a modern EDW… and I imagine that over the next several years this will become the reference architecture most of us will build to.

A Modern Data Warehouse Architecture: Part 2 – Make it a Logical Data Warehouse

In the first post of this series (here) I suggested that Hadoop could be effectively used as the platform for staging. Figure 2 describes the result. In this post I will extend the architecture by adding a data federation feature and turning the entire picture into a logical data warehouse… and then we’ll consider the consequences.

Figure 2. Add a Data Lake

Figure 3 shows this extension.. but lets quickly review what a federation fabric provides (for more detail on this please see the series that starts here).

Figure 3. Add a Federation Layer

First, the fabric allows any of the tables and files in the picture to be registered as “virtual” tables. This includes tables in EDWs, and any marts, in one or more Hadoop systems, the sandbox, and even in the source systems. From here the fabric is viewed by any programs in the BI, Analytics, Apps layer as a single relational database composed of all of the registered tables.This fabric would consume ODBC and JDBC queries at a minimum and provide some level of data translation, function translation, and query translation to allow all of the virtual tables to be queried through the single SQL dialect offer by the fabric. Finally, the fabric would provide some measure of optimization to reduce the overhead of accessing these distributed systems. It is this optimization that is the main topic of the series I suggested above.

One of the important implications of this… one that is often overlooked before implementation… is that the queries emitted from the fabric add workload to the underlying databases. If, in Figure 3, the top EDW is 100% busy servicing un-federated queries then adding a workload that joins that data to the 2nd EDW’s data will overburden the system. This is why it is not usually sound design to integrate source, OLTP systems into a fabric. The OLTP systems are not likely to be optimized in any way for the resulting workload.

But the upside of the fabric is significant. Consider:

  • If we just said “yes” and loaded ungoverned data into a sandbox the business users can immediately access that data and join it to dimensions and facts deployed elsewhere in the enterprise.
  • Rogue data marts can now be integrated back into the fold.
  • Redundant data deployed to allow joins within a single database instance can be eliminated and the joins can be federated. Note that federation is no silver bullet… there may be performance reasons for co-locating data… but you now can consider the trade-offs (I’ll post later on a way to federate and improve performance).
  • Data bases products can be retired without affecting the programs that access them. If you have an old data mart built on a product that you would rather not support or license… you can move the data and re-point the virtual tables without impacting the tools and applications at all.
  • Data can be relocated based on economics and/or for performance reasons… this will be the subject of the next blog in this series… but as a teaser, remember the economics… Hadoop costs $1K/TB (ok… $1K-$4K) hardware included and commercial databases cost much more.

There is one final advantage to this that is strategic and important enough to deserve to not be buried in the previous bullet list… All of us have seen the database product market move in the last 6+ years from a place where the choices were DB2, Oracle, SQL Server, or Teradata… to a market with those products plus Exadata, plus two flavors of SQL Server, plus Netezza, plus Greenplum, plus Vertica… and then plus Impala, plus Hive, plus Tez, plus Spark, plus MongoDB, plus Cassandra, plus plus plus. I think that it is impossible to place database bets today with the confidence that these bets will pay off five years from now… and certainly there are no bets good for ten years. If you are betting on infrastructure to support the Internet of Things the bets are more risky still. In my opinion a federation layer provides critical insulation from this chaos. With federation you can build applications knowing that you can acquire and retire database products and not affect the queries. IMO this insulation is a strategic imperative…

Sorry… I got a little excited… I don’t usually spew foam. Let me try again. IMO you should seriously consider the benefits of federation in both your EDW architecture and in your enterprise data architecture.

Part 3 of this series considers a new extension to the architecture by adding an EDW Annex… you can see it here.

References

A Modern Data Warehouse Architecture: Part 1 – Add a Data Lake

In this series I will present a perspective on how the classic EDW architecture is changing under the influence of new technologies, new requirements, and new economics.

Part 1 will show the first of three architectural changes by introducing a data lake into the picture… in Part 2 we’ll extend the picture by adding a logical EDW layer… and in Part 3 we’ll consider the implications of Hadoop as an EDW Annex.

Figure 1. A Classic EDW

Figure 1 describes a classic EDW architecture with sources feeding a staging area, transformations feeding cleaned and certified data to the EDW, and data consumed by analytic applications. In this picture we assume that data marts are part of the EDW fabric and that it is the responsibility of the applications to know which database to query. In most enterprises there are rogue data marts with ungoverned data and these are an issue… but more on that.

Note that Figure 1 also represents an optional sandbox area where uncertified and un-governed data can be landed This helps avoid paying the cost of data governance upfront… before data has proven it has value. A sandbox provides support for rogue data marts and cubes. Obviously use of uncertified data introduces issues… but many believe, as I do, that the trade-off of agility over governance can be productive under the right circumstances.

Figure 2. An EDW with a Data Lake

Figure 2 represents a step towards a modern EDW architecture with a Hadoop-based data lake replacing the staging area as well as providing support for a sandbox.

A data lake provides all of the same capabilities as the staging area it replaced… it may be a landing zone for untransformed data… but it has several other important characteristics:

  • A data lake can hold raw data forever rather than store it temporarily.
  • A data lake has compute included so it can execute transformations and before a single platform for staging and ETL.
  • A data lake has compute and tools included so it can be used to analyze raw data for trends and anomalies.
  • A data lake can easily store semi-structured and unstructured data.
  • A data lake can store big data.

Of course a staging area based on a RDBMS could do the same… but the economics are completely different. A Hadoop system provides storage and processing for as little as $1K/TB; the cost of an EDW hardware and software ranges from $15K/TB to $50K/TB… 15X to 50X more.

In addition, the data lake provides a cost-effective, extensible platform for building more sandboxes. If, by using Hadoop, you lower the upfront cost of a sandbox to $1K/TB then, when combined with the cost reduction from postponing the development of data quality and governance processes; you may find yourself in a position where you can just say “yes” to requests to add new data. At $1K/TB is may be unnecessary to force the business to build a business case and perform the difficult intellectual challenge of developing an ROI argument for data that has never been available and therefore has an unproven value proposition. Later, once the value is proven, you might move it to a governed state and to a governed platform.

Hadoop as an EDW staging area is not a new concept (see here)… the ETL vendors are supporting Hadoop as an ETL engine already. But a Hadoop staging approach starts to solve one of the nagging problems with the classic architecture: the lack of agility. Just saying “yes” to new data and corralling rogue marts provides a foundation to experiment and evolve while also providing the means to leverage successful experiments across the Enterprise.

There is more required to satisfy sandbox users… on to that in Part 2 – Logical Data Warehousing here.

References:

  • Wikitionary definition of a Data Lake here.
  • Informatica Big Data Edition here.
  • Curt Monash on IBM ETL here.

The Greenplum ORCA Optimizer

In January Greenplum rolled out a new query optimizer. This is very cool and very advanced stuff.

Query optimization is a search problem… in a perfect world you would search through the space of all possible plans for any query and choose the least expensive plan. But the time required to iterate through all possible plans would take more time than most queries… so optimizers use rules to cut down the space searched. The rules have been built up over the years and are designed to prune the space quickly to keep performance high for simple queries. But these rules can break down when complex queries are introduced… so Greenplum made the significant investment to build a new optimizer from scratch.

Florian Waas, the leader of this program for Greenplum (now off on another venture) explained it to me this way. If the large rectangle in Figure 1 represents the total search space for a query, a modern query optimizer only searches the area in the small gray square… it looks for the best plan in that small space.

You may be surprised to learn that the optimizers used by every major DBMS product are single-threaded… they use only one core of a multi-core processor to search the space and produce a plan. There is no way to effectively search more with a faster single processor (even though you could search more the amount of time you spend as a percentage of the query execution time would stay the same… because the query execution would speed up as well)… so if the optimizer is to search more of the space it will have to use multiple cores and search the space in parallel… and this is exactly what Greenplum has accomplished.

The benchmark results for this are impressive (see here)… several queries in the TPC-DS suite run hundreds of times faster.

ORCA is available to early support customers now and the results map to the benchmark… some queries see an extreme performance boost, while others run significantly slower. This is to be expected from any first release optimizer.

But Greenplum have built another advanced technology into ORCA to reduce the time it will take to mature the software. ORCA includes AMPERe, an optimizer debugging facility that captures the state necessary to recreate problems and fix them. Together these capabilities: parallel search and specialized debugging have advanced the state of the art significantly.

What does it mean to you? It will take some time to shake out ORCA… and HAWQ is still very slow when compared to other analytic databases… and very very slow when compared to the in-memory databases available… and in-memory products like Spark are coming to the Hadoop eco-system. But at the price point HAWQ is a bargain. If you need an inexpensive batch engine that crunches numbers offline then in the next year, as ORCA matures, it may be worth a look.

As a side note… this topic introduces one of the issues related to in-memory databases… when even a very complex query completes with a sub-optimal plan in under a second how much time can you spend searching the plan space? I suspect that applying the parallel optimization principles developed by the Greenplum team will yield similar or even better improvements for in-memory… and these techniques will be a requirement very soon in that space.

References

 

Logical Data Warehouses and the Basics of Database Federation

This post will consider the implications of a full database federation as would be required by a Logical Data Warehouse. I’ll build on the concepts introduced in the posts on RDBMS-Hadoop integration (Part1, Part 2, Part 3, Part 4, Part 5, Part 6, Part 7, Part 8).

Figure 1 summarizes those earlier concepts from simple to advanced.

Figure 1. 2 Tier Federation Maturity

But the full federation required to implement a logical data warehouse requires a significant step up from this. Simple federation will be a disaster and Basic federation will not be much better. Here is why.

Let’s add a database and use Figure 2 to consider the possibilities when we submit a query that joins Table A.One to A.Two to B.One to C.One. Note that in this picture we have included a Governor to execute the federated queries that is independent of any of the DBMSs… this is the usual case for federation.

In the simple case where the Governor executes the entire plan all of the data must come to the Governor. This is clearly unacceptable. Consider the worse case where a SELECT is issued against only one table… still all of the data must bubble up.

In the Basic case the problem is partially mitigated… less data moves after the predicates are resolved but the overhead will still kill query performance. A Governor with basic capabilities provides the minimal features to make this work. It is useful where slow federation is better than data replication… but that is about all.

Figure 2. N-Tier Federation

However, the advanced case becomes seriously more complicated. The optimizer now has to decide if table B.One should move to C to join the data or should it move to A… or should it move data to the Governor.

The problem is further complicated by any resource shortage on any node or any functional capability differences. If the cost of data movement would suggest moving B data to C… but there is no CPU resource available on C then maybe a different decision should be made? If C.One is a big table but C is a column-store and the cost of the SELECT is small because a minimum of columns are required and the cardinality of those columns is small so the data might be fetched from the dictionary then we might make a different decision. If B is a fast in-memory database but there is no memory available then the cost changes. Finally, if there are twenty databases in your logical DW then the problem increases exponentially.

The point here is clear… data federation over n-tiers is a hard problem. There will be severe performance issues when the optimizer picks wrong. This is why the independent governor model is so attractive… Many of the variables around CPU resources and database capabilities are removed… and while the performance will be poor it will be predictably poor. You should consider the implications carefully… it is just not clear that a high-performance logical data warehouse is feasible simply laid over an existing architecture. And if you build on a model with a Governor you must be sure that the Governor, and the Provincial databases can handle the load. I suspect that the Governor will have to run on a cluster and use a shared-nothing architecture to handle a true enterprise-sized logical EDW.

HANA has a twist on this that is interesting. The Governor lives inside one of the database nodes… so for data in HANA there is no data movement cost unless the optimizer decides to send the data to another node. Further, HANA is very fast… and the performance will mitigate some of the slowness inherent in federation. Finally, HANA is a shared-nothing DBMS… so it is not a problem to move lots of data to HANA in support of big tables and/or thousands of concurrent queries.

I’ll try to use this thinking: simple, basic, advanced federation over some governed federator on a an in-memory or fast shared-nothing architecture to evaluate the products on the market that provide federation. This may prove interesting as the Logical Data Warehouse concept catches on and as products like Teradata’s QueryGrid come to market.

Part 8 – How Hadooped is SQL Server PDW with Polybase?

Now for SQL Server… continuing the thread on RDBMS-Hadoop integration (Part 1Part 2, Part 3, Part 4Part 5, Part 6, Part 7) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

Before we start I will suggest a fourth criteria that will be more fully explored later when we consider networks and pipes… that is: how is data sharded/hashed/distributed as it moves from the distribution scheme in HDFS to an optimal, usually hashed, scheme in the target RDBMS. Consider Greenplum as an example… they move data in parallel as quickly as possible to the GPDB and then redistribute the data across GPDB segment nodes using scatter-gather, a very efficient distribution mechanism. We will consider how PDW Poybase manages this as part of our first criteria.

Also note… since I started this series Teradata has come out with a new capability: the QueryGrid. I will add a post to consider this separately… and in this note I will assume the older Teradata capability. This is a little unfair to Teradata and I apologize for that… but otherwise this post becomes too complex. I’ll make things right for Teradata ASAP.

Now on to Microsoft…

First, Polybase has effective parallel pipes to move data from HDFS to the parallel SQL Server instances in PDW. This matches the best capability of other products like Teradata and Greenplum in this category. But where Teradata and Greenplum move data and then redistribute it, pushing the data over a network twice, Poybase has pushed the PDW hash function down to the HDFS node so that data is distributed as it is sent. This very nice feature skips one full move of the data.

Our second criteria considers how smart the connector is in pushing down filters/predicates. Polybase uses a cost-based approach to determine whether is is less expensive to push predicates down or to move all of the data up to the PDW layer. This is a best-in-class capability.

For the 3rd criteria we ask does the architecture push down advanced functions like joins and aggregates… and does the architecture minimize data pulled up to join with semi-joins? Polybase again provides strong capabilities here pushing down joins and aggregates. Polybase does not use semi-joins, so there is room to improve here… but Microsoft clearly has this capability in their roadmap.

One final note… Polybase works with PDW but not with other SQL Server products. This limitation may be relevant in many cases.

PDW + Polybase is a strong offering… matching HANA in most aspects with HANA having a slight edge in push-down with semi-joins but with SQL Server matching this with the most sophisticated parallel data distribution capability.

References

Part 7 – How Hadooped is Greenplum, the Pivotal GPDB?

Now for Greenplum & Hadoop… to continue this thread on RDBMS-Hadoop integration (Part 1Part 2, Part 3, Part 4Part 5, Part 6) I have suggested that we could evaluate integration architecture using three criteria:

  1. How parallel are the pipes to move data between the RDBMS and the parallel file system;
  2. Is there intelligence to push down predicates; and
  3. Is there more intelligence to push down joins and other relational operators?

The Greenplum interface is architecturally similar to the Teradata interface described in Part 4. Hadoop files are defined to the DBMS as external tables and there are capable parallel pipes to effectively move data from the HDFS side to GPDB. In addition Greenplum uses their Scatter-Gather method to load data into the GPDB effectively.

There is no ability to push down predicates. When a query executes all of the relevant data is sucked through the parallel pipes into the database segments for processing. This is very inefficient and there is not even the crude capability to push down processing provided by Teradata.

Finally, there is no ability to push down joins or aggregation.

Greenplum’s offering is not very advanced. To perform with Greenplum analytics data must move between the two storage layers with no intelligence to mitigate the cost.

On to the last post in the series Part 8 on SQL Server and Polybase.

Dynamic Late Binding Schemas on Need

I very much like Curt Monash’s posts on dynamic schemas and schema-on-need… here and here are two examples. They make me think… But I am missing something… I mean that sincerely not just as a setup for a critical review. Let’s consider how dynamism is implemented here and there…so that I can ask a question of the audience.

First imagine a simple unschema’d row:

Rob KloppDatabase Fog Bloghttp://robklopp.wordpress.com42

A human with some context could see that there is a name string, a title string, a URL string, and an integer string. If you have the right context you would recognize that the integer holds the answer to the question: “What is the meaning of Life, the Universe, and Everything?”… see here… otherwise you are lost as to the meaning.

If you load this row into a relational database you could leave the schema out and load a string of 57 characters… or load the data parsed into a schema with Name, Title, URL, Answer. If you load this row into a key-value pair you can load it into an unschema’d row with the Key = Row and the Value equal to the string… or parse the data into four key-value pairs.

In any case you have to parse the data… If you store the data in an unschema’d format you have to parse the data and bind value to keys to columns late… if you store the data parsed then this step is unnecessary. To bind the data late in SQL you might create a view from your program… or more likely you would name the values parsed with SQL string functions. To parse the data into key-value pairs you must do the equivalent. The same logic holds true for more complex parsing. A graph database can store keys, values, and relationships… but these facets have to be known and teased out of the data either early or late. An RDBMS can do the same.

So what is the benefit of a database product that proclaims late binding as an advantage? Is it that late binding is easier to do than in an RDBMS? What am I missing?

Please do not respond with a list of other features provided by NewSQL and NoSQL databases… I understand many of the trade-offs… what I want to know is:

  • What can they do connected to binding values to names that an RDBMS cannot? And if there is no new functionality…
  • Is there someway they allow for binding that is significantly easier?

By the way, the Hitchhiker’s Guide is silent on the question of whether 42 is a constant or ever-changing. I think that I’ll ask Watson.

Exit mobile version
%%footer%%