Cloud Computing and Data Warehousing: Part 2 – An Elastic Data Warehouse

In Part 1 of this topic (here) I suggested that cloud computing has the ability to be elastic… to expand and maybe contract the infrastructure as CPU, memory, or storage requirements change. I also suggested that the workload on an EDW is intense and static to point out that there was no significant advantage to consolidating non-database workloads onto an over utilized EDW platform.

But EDW workload does flex some with the business cycle… quarter end reporting is additive to the regular daily workload. So maybe an elastic stretch to add resources and then a contraction has value? It most probably does add value.

The reason shared-nothing works is because it builds on a sharded model that splits the data across nodes and lets the CPU and I/O bandwidth scale together. This is very important… the limiting factor in these days of multi-core CPUs is I/O bandwidth and many nodes plus shards provides the aggregate I/O bandwidth of all disk controllers in the cluster.

What does that mean with regards to building an elastic data warehouse? It means that with each elastic stretch the data has to be re-deployed across the new number of shards. And because the data to be moved is embedded in blocks it means that the entire warehouse, every block, has to be scanned and re-written. This is an expensive undertaking on disk… one that bottlenecks at the disk controller and one that bottlenecks worse if there are fewer controllers (for example in in a SAN environment). Then, when the configuration is to shrink it process is repeated. In reality the cost of th I/Oe resources to expand and contract does not justify the benefit.

So… we conclude that while it is technically possible to build an elastic EDW it is not really optimal. In every case it is feasible to build a cloud-based EDW… it is possible to deploy a shared-nothing architecture, possible to consolidate workloads, and possible to expand and contract… but it is sub-optimal.

The real measure of this is that in no case would a cloud-based EDW proof-of-concept win business over a stand-alone cluster. The price of the cloudy EDW would be 2X for 1/2 the performance… and it is unlikely that the savings associated with cloud computing could make up this difference (the price of SAN is 2X that of JBOD and the aggregate I/O bandwidth is 1/2… for the same number of servers… hence the rough estimates). This is why EMC offers a Data Computing Appliance without a SAN. Further, this 4X advantage assumes that 100% of the SAN-based cluster is dedicated to the EDW. If 50% of the cluster is shared with some other workload then the performance drops by that 50%.

In the next post (here) I’ll consider Paraccel on the Amazon Cloud

Live and Learn… The Cost of Industry Standard Data Models

In a previous life I pushed for the development of industry data models at my employer… a large data warehousing company. The company was not interested so I left, started running a consulting firm, built a comprehensive Telco Data Model, and sold it to my ex-employer. It was the first of what is now a suite of industry-standard models offered by the firm.

Over time I have become less-enamoured of these industry models. First, 15 years ago, I became concerned that too many IT Enterprise Data Modelers were trying to force their companies to all speak the same enterprise langauge. I spoke at conferences suggesting that an Enterprise Model was a canonical model… a Rosetta stone… that let all of the specific dialects of the enterprise: marketing-speak, and finance-speak, and manufacturing-speak be integrated and translated. It was not meant to be spoken. I said that the Enterprise Data Model was like Esperanto… the best language ever invented… but that no one needed to speak it as English, Urdu, and Cantonese work just fine.

But as time has gone on the dialects in any given Enterprise have lost some of their uniqueness and there is a larger common dictionary available. This is in part the result of data warehousing… which provides a common tongue. And in part the result of MDM, which seeks to define a dictionary and grammar for the Enterprise. But I think that it is just the widespread use of a single relational paradigm more than anything else that provided a relational grammar to standardize how we describe Enterprise data. The result is more that each company has formed their own unique Enterprise language and the departmental dialects have softened.

So we were both a little right and wrong 15 years ago. The Enterprise modellers were right to believe that a universal Enterprise language would be a good thing. I was maybe correct in believing that no language should be imposed.

What has this to do with standard data models? First,  a standard model imposes an outside language on your corporation. The nouns and verbs map to the business… but the sound feels a little foreign and there is a learning curve to get there… it has the uncomfortable feel of being a Bostonian in Atlanta or of being a Yankee in Australia… each speaks English but there is something different going on that requires extra thought and concentration.

But the real issue is the hidden cost to your BI eco-system. When you ETL data from your source systems… systems in your unique Enterprise dialect… into the “standard” dialect more than a little effort is required.

I recently participated in a discussion with a company who had just requested bids for an end-to-end data warehouse/business intelligence re-development. They said that the bids for services: no hardware or software… bunched into three categories based on the fixed price bid.

At one end were two companies who offered industry standard data models. You could probably guess who they are. In the middle was a company that offered two pre-defined data mart data models in support of a very specific application requirement. At the other end was three companies who offered no data models… but would build them as part of the service. The bids within each group were remarkably similar.

Who was the least expensive? The firms who bid no data model were about 50% less expensive than the bidder who offered two standard models. The firms who bid no data models were about 1/2 the cost of the firms who offered standard data models.

What would explain this? It is the ETL costs associated with mapping and translating the unique Enterprise language of this company to the offered standard that make the difference. Fitting your companies data dictionary into the dictionary from another dialect within the same industry in an expensive proposition… it cost twice as much to develop into a standard data model… besides the human costs of having to learn the new dialect.

Sometimes my current employer is under-rated due to the lack of industry-standard models. But I’m no longer so sure that these are a good thing. Now may be the time to rethink.

I would like to acknowledge the article by Margy Ross at Kimball University (see http://www.informationweek.com/news/227400287) for starting my personal rethink on this topic.

The Worst Data Warehouse in the World

So far this blog has focused on issues related to database architecture… so this title might not seem on message. But architecture has implications.

The aim of any BI system is to support the decision-making process of the business. BI infrastructure is clearly a success when your company learns to make fact-based decisions as part of the day-to-day operation of the business. The best data warehouse in the world would be one that provides such effective decision support that the business gains a competitive advantage over the competition.

But I often run into companies where sweet success has turned sour. Why, because in these sour situations the BI eco-system cannot keep up. In these bad cases the best data warehouse in the world becomes the worst.

Usually the problem comes in one of two flavors: either the required decision support is unavailable in time to make a decision, or the eco-system cannot extend to support new business opportunities.

The first case usually shows up during periods when decision-making increases: during seasonal peaks in business. The second appears when the business grows: after a merger or when a new product is introduced. In both cases the cost of the failure is significant.

But these worst cases do not happen out of the blue. They creep up on you. There are symptoms. Often the first symptom is when the nightly reporting process starts missing its service level targets. That is, the nightly load of the warehouse and the refresh of the indexes, materialized views, the summary tables, the cubes, and the marts; and then the running of reports cannot complete in the batch window. This is followed by slow response in your online query processing as the nightly process creeps into the day. Then, the business asks for more users and/or for more data to be added and the problem grows… until decision-making is delayed or unsupported altogether.

Sadly, this problem is avoidable and the solution is well understood. All that is required is a scalable foundation that can extend through the addition of relatively inexpensive hardware. If you could easily add storage and compute then as the constraints hit you can scale up.

A shared nothing architecture scales. We have examples at Greenplum of production systems that scale from hundreds of gigabytes to thousands of terabytes… and other shared nothing vendors: Teradata and Netezza at least, can boast the same. When our customers run out of gas we add hardware. And the architecture scales bigger still… shared nothing is the foundation for all web scale data base technology… scaling to hundreds of petabytes.

So why do companies build, and continue to build, on shared memory systems with built-in limits? Because… they continually underestimate the growth in data… the failure is a failure of vision (consider the name “Teradata”… selected when a terabyte was considered nearly unreachable). Data does not just grow, it explodes in leaps and bounds as technology advances.

But let’s be real… Why do companies really select limiting infrastructure? Because they mistakenly believe that they can build BI infrastructure on technology designed for OLTP… and they already have DBAs trained on this technology who heavily influence the decision. Or, they have an enterprise license for the OLTP database and they want to save some money.

I imagine that I’ve made my point. The worst data warehouse in the world is a warehouse that constrains your business… one that cannot scale as the demand for data and decision support grows… one that costs you hundreds of thousands of dollars in staff time with every change… one that is tuned to the breaking point, rather than robust.

Why would anyone ever put their business at risk like this?

Co-processors and database machines: Intel, Netezza, and Teradata

Intel recently announced a change in direction regarding the use of special hardware co-processing to support RAID. This change is relevant to vendors of data warehouse appliances like Netezza, and to a smaller extent, Teradata, who both use co-processors to offload processing. A report on the Intel announcement is here: http://www.theregister.co.uk/2010/09/15/sas_in_patsburg/ . In short, Intel says that because multi-cores provide so much compute for so little cost it just does not make sense to build a special co-processor to support the RAID function.

What, you may ask, does this have to do with Netezza and Teradata. The answer is that both companies have built database machines that use co-processors more-or-less. And Intel’s conclusion that multi-cores obsolete co-processors significantly affects their value proposition to the market.

In the case of Netezza the significance is considerable. Netezza conceived the FPGA-based architecture when compute was expensive relative to I/O. A FPGA provided a way to add compute power at a low-cost point. But according to Intel this changed as multi-core technology matured. Further, as  the Intel processing line adds even more cores, the value of a FPGA co-processor becomes less still.

In the case of Teradata the issue is more manageable. The original Teradata Y-Net and their current BYNET interconnect offloads the merge process from the CPU to the interconnect. Back when Teradata ran on 8086 cores from Intel this was significant and it formed a cornerstone of Teradata’s value proposition. Today, the value of BYNET is diminished and one could imagine Teradata dropping it altogether sometime in the future for a less proprietary network fabric.

Please do not think that I am suggesting that multi-cores make either Teradata or Netezza obsolete. They are both formidable technologies. But multi-cores will reduce the price/performance advantage that co-processing once afforded in favor of database servers that do not use co-processing like Exadata, Greenplum, and DB2.

Exit mobile version
%%footer%%