Rob Klopp

A Modern Data Warehouse Architecture: Part 1 – Add a Data Lake

In this series I will present a perspective on how the classic EDW architecture is changing under the influence of new technologies, new requirements, and new economics.

Part 1 will show the first of three architectural changes by introducing a data lake into the picture… in Part 2 we’ll extend the picture by adding a logical EDW layer… and in Part 3 we’ll consider the implications of Hadoop as an EDW Annex.

A Classic EDW.png — Figure 1. A Classic EDW

Figure 1 describes a classic EDW architecture with sources feeding a staging area, transformations feeding cleaned and certified data to the EDW, and data consumed by analytic applications. In this picture we assume that data marts are part of the EDW fabric and that it is the responsibility of the applications to know which database to query. In most enterprises there are rogue data marts with ungoverned data and these are an issue… but more on that.

Note that Figure 1 also represents an optional sandbox area where uncertified and un-governed data can be landed This helps avoid paying the cost of data governance upfront… before data has proven it has value. A sandbox provides support for rogue data marts and cubes. Obviously use of uncertified data introduces issues… but many believe, as I do, that the trade-off of agility over governance can be productive under the right circumstances.

Figure 2 represents a step towards a modern EDW architecture with a Hadoop-based data lake replacing the staging area as well as providing support for a sandbox.

A data lake provides all of the same capabilities as the staging area it replaced… it may be a landing zone for untransformed data… but it has several other important characteristics:

A data lake can hold raw data forever rather than store it temporarily.
A data lake has compute included so it can execute transformations and before a single platform for staging and ETL.
A data lake has compute and tools included so it can be used to analyze raw data for trends and anomalies.
A data lake can easily store semi-structured and unstructured data.
A data lake can store big data.

Of course a staging area based on a RDBMS could do the same… but the economics are completely different. A Hadoop system provides storage and processing for as little as $1K/TB; the cost of an EDW hardware and software ranges from $15K/TB to $50K/TB… 15X to 50X more.

In addition, the data lake provides a cost-effective, extensible platform for building more sandboxes. If, by using Hadoop, you lower the upfront cost of a sandbox to $1K/TB then, when combined with the cost reduction from postponing the development of data quality and governance processes; you may find yourself in a position where you can just say “yes” to requests to add new data. At $1K/TB is may be unnecessary to force the business to build a business case and perform the difficult intellectual challenge of developing an ROI argument for data that has never been available and therefore has an unproven value proposition. Later, once the value is proven, you might move it to a governed state and to a governed platform.

Hadoop as an EDW staging area is not a new concept (see here)… the ETL vendors are supporting Hadoop as an ETL engine already. But a Hadoop staging approach starts to solve one of the nagging problems with the classic architecture: the lack of agility. Just saying “yes” to new data and corralling rogue marts provides a foundation to experiment and evolve while also providing the means to leverage successful experiments across the Enterprise.

There is more required to satisfy sandbox users… on to that in Part 2 – Logical Data Warehousing here.

References:

Wikitionary definition of a Data Lake here.
Informatica Big Data Edition here.
Curt Monash on IBM ETL here.

Using Teradata’s Appliance for Hadoop to Reduce TCO

Teradata has recently announced a very complete Teradata database-to-Hadoop integration. Is this note we’ll consider how a Teradata shop might effectively use these features to significantly reduce the TCO of any Teradata system.

The Teradata Appliance for Hadoop (here) offering is quite well thought out and complete… including a Teradata appliance, a Hadoop appliance, and the new QueryGrid capability to seamlessly connect the two… so hardware, software, support, and services are all available in very easy-to-consume bundles.

There is little published on the details of the QueryGrid feature… so I cannot evaluate where it stands on the query integration maturity curve (see here)… but it certainly provides a significant advance over the current offering (see here and Dan Graham’s associated comments).

I believe that there is some instant financial gratification to be had by existing Teradata customers from this Hadoop mashup. Let’s consider this…

Before the possibility of a Hadoop annex to Teradata, Teradata customers had no choice but to store cold, old, data in the Teradata database. If, on occasion, you wanted to perform year by year comparisons over ten years of data then you needed to keep ten years of data in the database at a rough cost of $50K/TB (see here) … even if these queries were rarely executed and were not expected to run against a high performance service level requirement. If you wanted to perform some sophisticated predictive analysis against this data it had to be online. If fact, the Teradata mantra… one which I wholeheartedly agree with… suggests that you really should keep the details online forever as the business will almost always find a way to glean value from this history.

This mantra is the basis of what the Hadoop vendors call a data lake. A data warehouse expert would quickly recognize a data lake as a staging area for un-scrubbed detailed data… with the added benefit that a Hadoop-based data lake can store and process data at a $1K/TB price point… and this makes it cost-effective to persist the staged data online forever.

So what does this mean to a Teradata EDW owner? Teradata has published numbers (here) suggesting that 92% of the queries in an EDW only touch 20% of the data. I would suggest that there is some sort of similar ratio that holds for 90% of the remaining queries… they may touch only another 40% of the data. This suggests that the 40% of the data remaining is online to service less than 1% of the queries… and I suggest that these queries can be effectively serviced from the $1K/TB Hadoop annex.

In other words, almost every Teradata shop can immediately benefit from Teradata’s new product announcements by moving 40% of their Teradata database data to Hadoop. Such a move would free Teradata disk space and likely take pressure off to upgrade the cluster. Further, when an upgrade is required, users can reduce the disk footprint of the Teradata database side of the system; add a Hadoop annex, and significantly reduce the TCO of the overall configuration.

Some time back I suggested that Teradata would be squeezed by Hadoop (here and here). To their credit Teradata is going to try and mitigate the squeeze. But the economics remain… and Teradata customers should seriously consider how to leverage the low $/TB of Teradata’s Hadoop offering to reduce costs. Data needs to reside in the lowest cost infrastructure that still provides the required level of service… and the Teradata Hadoop integration provides an opportunity to leverage a new, low-cost, infrastructure.

A Sidebar on the IoT: Using new Things smartly

This is a sidebar to some thinking on an architecture for the Internet of Things here… – Rob

I was recently prompted to think about a Big Data problem that is in the US papers… the issues around processing US Veterans through the Veteran’s Administration (VA) bureaucracy. I imagine that there are really two problems… I will outline them… but the point will be to try to get you to think about how to use the IoT creatively to mitigate seemingly intractable problems.

One requires some obvious, if not easy-to-implement, information technology… there has to be some fool-proof way to ensure that VA staff cannot game the system and hide problems. This could be solved with an audit system that looks for anomalies in processing patterns in much the same way that other fraud detection software operates.

But fraud is not the problem… it was meant to mask the problem. The fact is that there are more Veterans requiring medical assistance than there are funds and doctors to provide that assistance. All the fraud detection would have accomplished is to show how many Veterans are going without care.

So… the IT folks at the VA are tasked with the impossible task of servicing too many Veterans with too few doctors… But military doctors are used to this issue and the word “triage” comes from the vocabulary of war… not the vocabulary of medicine.

This leads me towards the point… there are new things, in the IoT sense, coming out that can make triage of Veterans who are not hospitalized possible. Rumors are that the new iWatch contains several sensors that can monitor the pulse, temperature, and maybe more, of wearers. What the VA needs to do is immediately put an iWatch on every Veteran who applies for medical help. They then need to monitor the vitals of all of the veterans and schedule in those with the weakest signs. They need to notify individuals to come in immediately when vitals turn bad… Further, they need to track these vitals against medical records and outcomes to make the triage ever more efficient over time.

By using internet Things to triage the VA could use their limited medical resource much more efficiently… and mitigate some or all of the problem.

The IoT affords us all opportunities to change the game for our employers. There will be opportunities to disrupt some existing markets. But to take full advantage will require us to be creative and smart. I, of one, am anxious to start… Now back to considering an architecture for things.

An Architecture for the IoT – Part 1

There are so many things in the Internet of Things (IoT) that might record data into your data fabric that a new approach may be required. Let’s think about this… define some terms, and see how these terms fit into current data fabric thinking, let’s consider how they fit into a more modern logical data warehouse architecture, and let’s think about whether the IoT might push us to a different approach.

I’m not going to go overboard on terms here… But we do need to distinguish between a sensor and a processor.

To my way of thinking a sensor is a thing. It creates, but does not necessarily process, data. A sensor has some means to communicate with a processor… but if there is no significant processing on the sensor other than communications then we will suggest that there is no “processor” in a meaningful sense. Let me give you four examples:

The first is courtesy of Ray Carnes, a chief architect at Boeing. Imagine a brake-pad in your car with 100,000 dust-sized RFID sensors randomly scattered as part of the pad. These sensors do nothing but signal on an interval that they are present. This allows a processor elsewhere to record the signals and determine how much of the brake pad has worn. If only 80,000 sensors report we can assume that 20% of the pad has worn away.
A Nest thermostat senses movement and temperature. It uses a network-connect to send the results of this sensing to the Nest mother-ship and performs little-or-no processing on site.
Sensors in my Audi detect rotation of the wheels. There is a network that sends the results to a small embedded anti-lock braking processor that monitors all four wheels as well as the pressure on the brake-pedal and sends signals to all five components to allow the car to brake evenly.
There is a sensor in the screen on the ATM I used yesterday that detects that I want to request service. This user interface communicates with a powerful general processor which then communicates with the Bank mother-ship to create and process banking transactions.

This last bullet is important… any device that takes user input is a sensor with an embedded processor. It is a “thing” just like the Nest thing. Today we tend to blur the line between sensor and processor as every thing has a powerful processor onboard. The IoT will change this assumption.

A processor then, is a computer that performs some analysis on the data generated by one or more sensors. A processor may also store data… a sensor will not.

Now let’s think about how we might combine sensors and processors in an architecture. To start lets consider the context of the data the processor can use for analysis:

If the processor has only the last data sensed we would say that the context is immediate and local to one sensor. The processor can see streamed data but can only operate on the last event. We would say that this sensor-processor configuration can provide a simple reflexive response. When you press the lock button in your car a sensor detects this event and signals to all four doors and the boot to lock it up.
Another configuration might allow the local processor to store more context from a single sensor over a longer period of time… so the context is historical and local. In the case of the anti-lock brakes… the processor receives signals from a group of sensors and stores a very short historical context. This grouped historical context is very powerful…
Another configuration might store the group context and then forward the event details to a bigger server that stores and analyzes a universal context of all things to look for patterns. Further, there could be a hierarchy of groups leading to a universal context.
Finally, a server with some group context could summarize the details for that group and pass on only a summary over time up to another group server or to a universal server.

I suspect that you can see where I’m going. There is a trade-off is this picture between the advantages of pushing analytic processing close to the sensor and the associated requirement for more analytic processors, the advantages of intermediate analysis requiring more data movement but fewer analytic processors, and the advantage of a central analytic mother ship where all data is stored and analyzed. In the next version of this thread I’ll try to tease apart the trade-offs.

The Greenplum ORCA Optimizer

In January Greenplum rolled out a new query optimizer. This is very cool and very advanced stuff.

Query optimization is a search problem… in a perfect world you would search through the space of all possible plans for any query and choose the least expensive plan. But the time required to iterate through all possible plans would take more time than most queries… so optimizers use rules to cut down the space searched. The rules have been built up over the years and are designed to prune the space quickly to keep performance high for simple queries. But these rules can break down when complex queries are introduced… so Greenplum made the significant investment to build a new optimizer from scratch.

Florian Waas, the leader of this program for Greenplum (now off on another venture) explained it to me this way. If the large rectangle in Figure 1 represents the total search space for a query, a modern query optimizer only searches the area in the small gray square… it looks for the best plan in that small space.

You may be surprised to learn that the optimizers used by every major DBMS product are single-threaded… they use only one core of a multi-core processor to search the space and produce a plan. There is no way to effectively search more with a faster single processor (even though you could search more the amount of time you spend as a percentage of the query execution time would stay the same… because the query execution would speed up as well)… so if the optimizer is to search more of the space it will have to use multiple cores and search the space in parallel… and this is exactly what Greenplum has accomplished.

The benchmark results for this are impressive (see here)… several queries in the TPC-DS suite run hundreds of times faster.

ORCA is available to early support customers now and the results map to the benchmark… some queries see an extreme performance boost, while others run significantly slower. This is to be expected from any first release optimizer.

But Greenplum have built another advanced technology into ORCA to reduce the time it will take to mature the software. ORCA includes AMPERe, an optimizer debugging facility that captures the state necessary to recreate problems and fix them. Together these capabilities: parallel search and specialized debugging have advanced the state of the art significantly.

What does it mean to you? It will take some time to shake out ORCA… and HAWQ is still very slow when compared to other analytic databases… and very very slow when compared to the in-memory databases available… and in-memory products like Spark are coming to the Hadoop eco-system. But at the price point HAWQ is a bargain. If you need an inexpensive batch engine that crunches numbers offline then in the next year, as ORCA matures, it may be worth a look.

As a side note… this topic introduces one of the issues related to in-memory databases… when even a very complex query completes with a sub-optimal plan in under a second how much time can you spend searching the plan space? I suspect that applying the parallel optimization principles developed by the Greenplum team will yield similar or even better improvements for in-memory… and these techniques will be a requirement very soon in that space.

References

PivotalGuru: Greenplum Database 4.3
Theory in Practice: How to Build a Query Optimizer for Big Data
SIGMOD 2009: Parallelizing Extensible Query Optimizers

Logical Data Warehouses and the Basics of Database Federation

This post will consider the implications of a full database federation as would be required by a Logical Data Warehouse. I’ll build on the concepts introduced in the posts on RDBMS-Hadoop integration (Part1, Part 2, Part 3, Part 4, Part 5, Part 6, Part 7, Part 8).

Figure 1 summarizes those earlier concepts from simple to advanced.

But the full federation required to implement a logical data warehouse requires a significant step up from this. Simple federation will be a disaster and Basic federation will not be much better. Here is why.

Let’s add a database and use Figure 2 to consider the possibilities when we submit a query that joins Table A.One to A.Two to B.One to C.One. Note that in this picture we have included a Governor to execute the federated queries that is independent of any of the DBMSs… this is the usual case for federation.

In the simple case where the Governor executes the entire plan all of the data must come to the Governor. This is clearly unacceptable. Consider the worse case where a SELECT is issued against only one table… still all of the data must bubble up.

In the Basic case the problem is partially mitigated… less data moves after the predicates are resolved but the overhead will still kill query performance. A Governor with basic capabilities provides the minimal features to make this work. It is useful where slow federation is better than data replication… but that is about all.

However, the advanced case becomes seriously more complicated. The optimizer now has to decide if table B.One should move to C to join the data or should it move to A… or should it move data to the Governor.

The problem is further complicated by any resource shortage on any node or any functional capability differences. If the cost of data movement would suggest moving B data to C… but there is no CPU resource available on C then maybe a different decision should be made? If C.One is a big table but C is a column-store and the cost of the SELECT is small because a minimum of columns are required and the cardinality of those columns is small so the data might be fetched from the dictionary then we might make a different decision. If B is a fast in-memory database but there is no memory available then the cost changes. Finally, if there are twenty databases in your logical DW then the problem increases exponentially.

The point here is clear… data federation over n-tiers is a hard problem. There will be severe performance issues when the optimizer picks wrong. This is why the independent governor model is so attractive… Many of the variables around CPU resources and database capabilities are removed… and while the performance will be poor it will be predictably poor. You should consider the implications carefully… it is just not clear that a high-performance logical data warehouse is feasible simply laid over an existing architecture. And if you build on a model with a Governor you must be sure that the Governor, and the Provincial databases can handle the load. I suspect that the Governor will have to run on a cluster and use a shared-nothing architecture to handle a true enterprise-sized logical EDW.

HANA has a twist on this that is interesting. The Governor lives inside one of the database nodes… so for data in HANA there is no data movement cost unless the optimizer decides to send the data to another node. Further, HANA is very fast… and the performance will mitigate some of the slowness inherent in federation. Finally, HANA is a shared-nothing DBMS… so it is not a problem to move lots of data to HANA in support of big tables and/or thousands of concurrent queries.

I’ll try to use this thinking: simple, basic, advanced federation over some governed federator on a an in-memory or fast shared-nothing architecture to evaluate the products on the market that provide federation. This may prove interesting as the Logical Data Warehouse concept catches on and as products like Teradata’s QueryGrid come to market.

Netezza Zone Maps and I/O Avoidance

A reader recently wrote to me and asked about Netezza: “why does everyone insist that these (zone maps) tell you where ‘not to look’ when hunting for data?”. I’ll provide a direct answer… and a more meaningful answer.

Imagine that you have a list of data blocks with some metadata for each block that tells you the range of data in each block for a given column: FOO as follows:

Block 1: FOO 0-173
Block 2: FOO 174-323
Block 3: FOO 324-500

and a query that selects WHERE FOO=42.

If Netezza scanned the metadata and sent its read routine the list of blocks to not read… it would send 2,3. This is clearly not the case… if there are a million blocks it would not send a list of 999999 block numbers to not read… and force the read routine to figure out what was left to read. So clearly Netezza does not really tell you where ‘not to look’. This is a clever turn-of-phrase.

But I like this particular cleverness. Every DBMS is built with features designed to avoid I/O:

indexes are metadata that point to blocks to reduce reading unnecessary blocks;
partitions contain metadata to reduce reading unnecessary partitions;
column stores are designed to reduce reads of unnecessary columns; and
caches hold hot data blocks to avoid re-reading those blocks.

In fact, the highest performance DBMS will almost always be the one that most effectively minimizes I/O. This is why in-memory databases always have the highest performance.

So while zone maps do not really tell the system directly what not to read… the effect is ‘not to look’ at unnecessary data.

Part 8 – How Hadooped is SQL Server PDW with Polybase?

Now for SQL Server… continuing the thread on RDBMS-Hadoop integration (Part 1, Part 2, Part 3, Part 4, Part 5, Part 6, Part 7) I have suggested that we could evaluate integration architecture using three criteria:

How parallel are the pipes to move data between the RDBMS and the parallel file system;
Is there intelligence to push down predicates; and
Is there more intelligence to push down joins and other relational operators?

Before we start I will suggest a fourth criteria that will be more fully explored later when we consider networks and pipes… that is: how is data sharded/hashed/distributed as it moves from the distribution scheme in HDFS to an optimal, usually hashed, scheme in the target RDBMS. Consider Greenplum as an example… they move data in parallel as quickly as possible to the GPDB and then redistribute the data across GPDB segment nodes using scatter-gather, a very efficient distribution mechanism. We will consider how PDW Poybase manages this as part of our first criteria.

Also note… since I started this series Teradata has come out with a new capability: the QueryGrid. I will add a post to consider this separately… and in this note I will assume the older Teradata capability. This is a little unfair to Teradata and I apologize for that… but otherwise this post becomes too complex. I’ll make things right for Teradata ASAP.

Now on to Microsoft…

First, Polybase has effective parallel pipes to move data from HDFS to the parallel SQL Server instances in PDW. This matches the best capability of other products like Teradata and Greenplum in this category. But where Teradata and Greenplum move data and then redistribute it, pushing the data over a network twice, Poybase has pushed the PDW hash function down to the HDFS node so that data is distributed as it is sent. This very nice feature skips one full move of the data.

Our second criteria considers how smart the connector is in pushing down filters/predicates. Polybase uses a cost-based approach to determine whether is is less expensive to push predicates down or to move all of the data up to the PDW layer. This is a best-in-class capability.

For the 3rd criteria we ask does the architecture push down advanced functions like joins and aggregates… and does the architecture minimize data pulled up to join with semi-joins? Polybase again provides strong capabilities here pushing down joins and aggregates. Polybase does not use semi-joins, so there is room to improve here… but Microsoft clearly has this capability in their roadmap.

One final note… Polybase works with PDW but not with other SQL Server products. This limitation may be relevant in many cases.

PDW + Polybase is a strong offering… matching HANA in most aspects with HANA having a slight edge in push-down with semi-joins but with SQL Server matching this with the most sophisticated parallel data distribution capability.

References

Microsoft Gray Systems Lab: Polybase
SIGMOD: Split Query Processing in Polybase

Part 7 – How Hadooped is Greenplum, the Pivotal GPDB?

Now for Greenplum & Hadoop… to continue this thread on RDBMS-Hadoop integration (Part 1, Part 2, Part 3, Part 4, Part 5, Part 6 ) I have suggested that we could evaluate integration architecture using three criteria:

How parallel are the pipes to move data between the RDBMS and the parallel file system;
Is there intelligence to push down predicates; and
Is there more intelligence to push down joins and other relational operators?

The Greenplum interface is architecturally similar to the Teradata interface described in Part 4. Hadoop files are defined to the DBMS as external tables and there are capable parallel pipes to effectively move data from the HDFS side to GPDB. In addition Greenplum uses their Scatter-Gather method to load data into the GPDB effectively.

There is no ability to push down predicates. When a query executes all of the relevant data is sucked through the parallel pipes into the database segments for processing. This is very inefficient and there is not even the crude capability to push down processing provided by Teradata.

Finally, there is no ability to push down joins or aggregation.

Greenplum’s offering is not very advanced. To perform with Greenplum analytics data must move between the two storage layers with no intelligence to mitigate the cost.

On to the last post in the series Part 8 on SQL Server and Polybase.

Dynamic Late Binding Schemas on Need

I very much like Curt Monash’s posts on dynamic schemas and schema-on-need… here and here are two examples. They make me think… But I am missing something… I mean that sincerely not just as a setup for a critical review. Let’s consider how dynamism is implemented here and there…so that I can ask a question of the audience.

First imagine a simple unschema’d row:

Rob KloppDatabase Fog Bloghttp://robklopp.wordpress.com42

A human with some context could see that there is a name string, a title string, a URL string, and an integer string. If you have the right context you would recognize that the integer holds the answer to the question: “What is the meaning of Life, the Universe, and Everything?”… see here… otherwise you are lost as to the meaning.

If you load this row into a relational database you could leave the schema out and load a string of 57 characters… or load the data parsed into a schema with Name, Title, URL, Answer. If you load this row into a key-value pair you can load it into an unschema’d row with the Key = Row and the Value equal to the string… or parse the data into four key-value pairs.

In any case you have to parse the data… If you store the data in an unschema’d format you have to parse the data and bind value to keys to columns late… if you store the data parsed then this step is unnecessary. To bind the data late in SQL you might create a view from your program… or more likely you would name the values parsed with SQL string functions. To parse the data into key-value pairs you must do the equivalent. The same logic holds true for more complex parsing. A graph database can store keys, values, and relationships… but these facets have to be known and teased out of the data either early or late. An RDBMS can do the same.

So what is the benefit of a database product that proclaims late binding as an advantage? Is it that late binding is easier to do than in an RDBMS? What am I missing?

Please do not respond with a list of other features provided by NewSQL and NoSQL databases… I understand many of the trade-offs… what I want to know is:

What can they do connected to binding values to names that an RDBMS cannot? And if there is no new functionality…
Is there someway they allow for binding that is significantly easier?

By the way, the Hitchhiker’s Guide is silent on the question of whether 42 is a constant or ever-changing. I think that I’ll ask Watson.

References:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: