Buzzwords – Page 9 – Database Fog Blog

HANA and ABAP

One more surprise…

In the past SAP applications have, in general, avoided using database features. Even a SELECT with a projection was out-of-bounds. They did not want to depend on any database, so they tended to pull all data from the data layer to the application layer and loop through the data using procedural languages like ABAP. You might say that they were religiously database agnostic. My mistake… you might say that we were religiously database agnostic. I have to get used to these new surroundings.

Besides the obvious attributes of HANA: in-memory, shared-nothing, MPP, and column-oriented… the aim is to move the application logic next to the data and into HANA.

Any of you who have labored to convert procedural code into set-based SQL will understand the issue here. There are hundreds of thousands or millions of lines of procedural code… often very simple loops… that have to be converted to SQL to make the HANA architecture support the SAP application portfolio.

The surprise is not that there is this outstanding issue.. nor is it the ambitious architecture designed to push the application deep into the database (we are not talking about SQL-based stored procedures… we are talking about the application). The surprise is that the HANA development team has built a state-of-the-art facility that programmatically converts procedural logic into its set-based equivalent (not necessarily into SQL but sometimes into a language that can execute in-parallel). This is not a tool requiring manual intervention… it is an automatic, mathematically provable, transformation.

Right now the technique is used to covert logic in stored-procedures and in ABAP. But I hope to see it applied in the optimizer to convert those ugly Oracle cursor loops on-the-fly.

You can read more here.

By the way… SAP will continue to support ABAP using the database as a file server… moving all of the data from the database server to the application server for processing. But you can imagine that… when running applications that use this powerful capability… over time HANA will emerge with a huge performance advantage over other databases…

Oracle should be worried.

Numbers Everyone Should Know

Some of you have seen me build simple models to do a reality-check on architecture (see here, for example). Here are some metrics from a great presentation by Jeff Dean, a Google fellow.

Numbers Everyone Should Know

L1 cache reference	0.5 ns
Branch mispredict	5 ns
L2 cache reference	7 ns
Mutex lock/unlock	25 ns
Main memory reference	100 ns
Compress 1K bytes with Zippy	3,000 ns
Send 2K bytes over 1 Gbps network	20,000 ns
Read 1 MB sequentially from memory	250,000 ns
Round trip within same datacenter	500,000 ns
Disk seek	10,000,000 ns
Read 1 MB sequentially from disk	20,000,000 ns
Send packet CA->Netherlands->CA	150,000,000 ns

Of note is the 120X difference between the cost of reading 1MB from memory and the cost of reading 1MB from disk…

The entire presentation may be found here: http://www.odbms.org/download/dean-keynote-ladis2009.pdf

Happy Modeling…

Who is Massively Parallel? HANA vs. Teradata and (maybe) Oracle

I have promised not to promote HANA heavily on this site… and I will keep that promise. But I want to share something with you about the HANA architecture that is not part of the normal marketing in-memory database (IMDB) message: HANA is parallel from its foundation.

What I mean by that is that when a query is executed in-memory HANA dynamically shards the data in-memory and lets each core start a thread to work on its shard.

Other shared-nothing implementations like Teradata and Greenplum, which are not built on a native parallel architecture, start multiple instances of the database to take advantage of multiple cores. If they can start an instance-per-core then they approximate the parallelism of a native implementation… at the cost of inter-instance communication. Oracle, to my knowledge, does not parallelize steps within a single instance… I could be wrong there so I’ll ask my readers to help?

As you would expect, for analytics and complex queries this architecture provides a distinct advantage. HANA customers are optimizing price models sub-second in-real-time with each quote instead of executing a once-a-week 12-hour modeling job.

June 11, 2013: You can find a more complete and up-to-date discussion of this topic here… – Rob

As you would expect HANA cannot yet stretch into the petabyte range. The current HANA sweet spot is for warehouses or marts is in the sub-TB to 20TB range.

Real-time Analytics and BI: Part 1 – Singing for my Dinner

Several months ago I was invited to a dinner attached to a data science summit… with the price being that I had to deliver a 5 minute talk… I had to sing for my dinner. The result was this thinking on real-time analytics and the Toyota Prius.

Real-time analytics implies two things:

Changes in the data are evaluated continuously; and
The results of the analysis are used or displayed continuously.

In a Toyota Prius we can see two examples of real-time analytics.

The first is in the anti-lock braking system. There data reflecting the pressure on the brake pedal and on rotation of each wheel is sent to a computer that analyzes the results and adjusts the brake pressure on each wheel so that all four wheels turn at the same rate and the car stops in a straight line.

Note that the analytics are real-time and the results are used immediately without human intervention. This is important. It makes little sense to spend the money to capture and analyze data in real-time if the results are not actionable in near-real-time.

Think for a moment about the BI systems built over the last 20 years. First we captured and analyzed monthly data… and acted on that data within a 30-day window. Then we increased the granularity of the data to weekly and slightly adjusted the reports to reflect the finer granularity… and acted on the data within 7 days. Then we adjusted the data to daily and acted on the results each day. Then we adjusted the data to hourly and reacted even more quickly. These changes often did not fundamentally change the business processes driven by the data… they just made the processes more sensitive to the fine-grained information.

But if the data-driven business process takes ten minutes to complete… for example it takes ten minutes for staff to pick inventory, package the results, and load a delivery truck; could there be a return on the investment expense of developing a continuous, real-time analytic? I think not. There may, however, be ROI associated with a new robotic pick, package, and load process…

There is another possibility… If sometimes the pick, package, and load takes ten minutes and sometimes it takes fifteen minutes then the best solution is to perform the analytics on the current state on-demand… when there are resources to support the process. This maximizes the use of the resources without changing the business process.

The point here is that real-time requires a re-think… or at least a deep-think. The business process may have to change significantly to support real-time analytics.

The second real-time system in the Prius illustrates the problem. On the dashboard the Prius displays, in real-time, the state of the hybrid gas-electric system. It shows whether the battery is charging or discharging… it shows whether the car is being driven using the electric or the internal-combustion engine. It is one of the most beautiful dashboard displays you have ever seen… and executives everywhere must look at it and wonder why they cannot get such a beautiful display of the state of their business… after-all… BI dashboards are “the thing”.

But the Prius display is useless. There is no action you would take while driving based on this real-time display.From a decision-making view it represents useless and expensive flash (that helps to sell the Prius…).

So… approach real-time analytics with a deep-think. Look for opportunities like the anti-lock braking system where real-time analytics can be embedded into automatic business processes. Avoid flashy dashboards that do not present actionable data.

In-memory databases (IMDB) such as SAP HANA, Oracle TimesTen, and VMWare SQLFire promise to enable real-time analytics… and this promise is real… the opportunities can and will revolutionize the enterprise over time… but a revolution is not the same old BI at a finer granularity… it is much more significant than that. Heads will roll.

Cloud Computing and Data Warehousing: Part 4 – IMDB Data Warehouse in a Cloud

In the previous blogs on this topic (Part 1, Part 2, Part 3) I suggested that:

Shared-nothing is required for an EDW,
An EDW is not usually under-utilized,
There are difficulties in re-distributing sharded, shared-nothing data to provide elasticity, and
A SAN cannot provide the same IO bandwidth per server as JBOD… nor hit the same price/performance targets.

Note that these issues are tied together. We might be able to spread the EDW workload over so many shards and so many SANs that the amount of I/O bandwidth per GB of EDW data is equal to or greater than that provided on a DW Appliance. This introduces other problems as there are typically overhead issues with a great many nodes. But it could work.

But what if we changed the architecture so that I/O was not the bottleneck? What if we built a cloud-based shared-nothing in-memory database (IMDB)? Now the data could live on SAN as it would only be read at start-up and written at shut-down… so the issues with the disk subsystem disappear… and issues around sharing the SAN disappear. Further, elasticity becomes feasible. With an IMDB we can add and delete nodes and re-distribute data without disk I/O… in fact it is likely that a column store IMDB could move column-compressed data without re-building rows. IMDB changes the game by removing the expense associated with disk I/O.

There is evidence emerging that IMDB technology is going to change the playing field (see here).

Right now there are only a few IMDB products ready in the market:

TimeTen: which is not shared-nothing scalable, nor columnar, but could be the platform for a very small, 400GB or less (see here), cloud-based EDW;
SQLFire: which is semi-shared-nothing scalable (no joins across shards), not columnar, but could be the platform for a larger, maybe 5TB, specialized EDW;
ParAccel: which is shared-nothing scalable, columnar, but not fully an IMDB… but could be (see C. Monash here); or
SAP HANA: which is shared-nothing, IMDB, columnar and scalable to 100TB (see here).

So it is early… but soon enough we should see real EDWs in the cloud and likely on Amazon EC2, based on in-memory database technologies.

Cloud Computing and Data Warehousing: Part 3 – ParAccel on EC2

In the previous post here I suggested that a SAN-based, cloudy, EDW is about 4X the cost for the same performance over a data warehouse appliance.. and I described why. I have actually seen this comparison.

It is difficult to compare Amazon EC2 hardware to the hardware typically assembled in a shared-nothing EDW cluster whether the hardware is from HP, Dell, Sun, IBM, or Teradata. So let’s assume that Amazon gets a 20% edge due to huge volume purchases over your firm. Note that this is a significant edge since the hardware is a commodity. Further, lets assume that Amazon gets another 30% edge in TCO on system administration costs. This is the cost of staff to manage the Linux OS and the hardware components. This may also be generous to the Amazon side of the equation. The numbers are not important… you can put in whatever seems to model your situation best… if you work for a large efficient company the numbers may go down for EC2.

Lets also assume that you reserve and receive dedicated hardware on EC2. This will not be the case but lets continue to build a best-case scenario for EC2.

From these numbers we can assume that the EC2 configuration will be 3X the cost for the same performance as a dedicated purpose-built database cluster. Again this assumes that the EC2 hardware is dedicated so this number is optimistic.

So why would anyone do this? Because EC2 has no up-front capital expense associated… it is an operating expense. This is significant.

So what is the advantage of buying ParAccel on EC2? I’m unsure. ParAccel has not done particularly well in the marketplace… but it is not clear that this is a technology issue. The answer could lie in the fact that companies deploy ParAccel on EC2 for data mart or application-specific workloads that may not use 100% of the hardware resources provided?

I think that if you work through these three blogs you can get an idea of how to model the opportunity for yourself. If the ability to spend OPEX dollars with Amazon is important… even if you need 3X the hardware… then this is a very interesting way to go.

But do not imagine that you are getting the same performance with ParAccel on EC2 that you wold get with ParAccel on HP or Dell… for a fraction of the price. There is no architectural advantage in ParAccel on EC2 over Vertica or Greenplum or any other DBMS that can run on EC2… ParAccel is, however, trying something new and interesting… if you understand the trade-offs.

In the last blog of this series (here) I’ll discuss some new approaches that may change the game… including another interesting possibility for ParAccel going forward.

Cloud Computing and Data Warehousing: Part 2 – An Elastic Data Warehouse

In Part 1 of this topic (here) I suggested that cloud computing has the ability to be elastic… to expand and maybe contract the infrastructure as CPU, memory, or storage requirements change. I also suggested that the workload on an EDW is intense and static to point out that there was no significant advantage to consolidating non-database workloads onto an over utilized EDW platform.

But EDW workload does flex some with the business cycle… quarter end reporting is additive to the regular daily workload. So maybe an elastic stretch to add resources and then a contraction has value? It most probably does add value.

The reason shared-nothing works is because it builds on a sharded model that splits the data across nodes and lets the CPU and I/O bandwidth scale together. This is very important… the limiting factor in these days of multi-core CPUs is I/O bandwidth and many nodes plus shards provides the aggregate I/O bandwidth of all disk controllers in the cluster.

What does that mean with regards to building an elastic data warehouse? It means that with each elastic stretch the data has to be re-deployed across the new number of shards. And because the data to be moved is embedded in blocks it means that the entire warehouse, every block, has to be scanned and re-written. This is an expensive undertaking on disk… one that bottlenecks at the disk controller and one that bottlenecks worse if there are fewer controllers (for example in in a SAN environment). Then, when the configuration is to shrink it process is repeated. In reality the cost of th I/Oe resources to expand and contract does not justify the benefit.

So… we conclude that while it is technically possible to build an elastic EDW it is not really optimal. In every case it is feasible to build a cloud-based EDW… it is possible to deploy a shared-nothing architecture, possible to consolidate workloads, and possible to expand and contract… but it is sub-optimal.

The real measure of this is that in no case would a cloud-based EDW proof-of-concept win business over a stand-alone cluster. The price of the cloudy EDW would be 2X for 1/2 the performance… and it is unlikely that the savings associated with cloud computing could make up this difference (the price of SAN is 2X that of JBOD and the aggregate I/O bandwidth is 1/2… for the same number of servers… hence the rough estimates). This is why EMC offers a Data Computing Appliance without a SAN. Further, this 4X advantage assumes that 100% of the SAN-based cluster is dedicated to the EDW. If 50% of the cluster is shared with some other workload then the performance drops by that 50%.

In the next post (here) I’ll consider Paraccel on the Amazon Cloud…

A Big Data Sound Bite…

Here is a sound bite on Big Data I composed for another source…

Big Data is relative. For some firms Big Data will be measured in petabytes and for other in hundreds of gigabytes. The point is that very detailed data provides the vital statistics that quantify the health of your business.

To store and access Big Data you need to build on a scalable platform that can grow. To process Big Data you need a fully scalable parallel computing environment.

With the necessary infrastructure in place the challenge becomes: how do you gauge your business and how do you change the decision-making processes to use the gauges?

More on Big Data… and on Big Data Analytics… and on a definition of a Big Data Store…

After a little more thinking I’m not sure that Big Data is a new thing… rather it is a trend that has “crossed the chasm” and moved into the mainstream. Call Detail records are Big Data and they are hardly new. In the note below I will suggest that, contrary to the long-standing Teradata creed, Big Data is not Enterprise Data Warehouse (EDW) data. It belongs in a new class of warehouse to be defined…

The phrase “Big Data” refers to a class of data that comes in large volumes and is not usually joined directly with your Enterprise Data Warehouse data… even if it is stored on the same platform. It is very detailed data that must be aggregated and summarized and analyzed to meaningfully fit into an EDW. It may sit adjacent to the EDW in a specialized platform tailored to large-scale data processing problems.

Big Data may be data structured in fields or columns, semi-structured data that is de-normalized and un-parsed, or unstructured data such as text, sound, photographs, or video.

The machinery that drives your enterprise, either software or hardware, is the source of big Data. It is operational data at the lowest level.

Your operations staff may require access to the detail, but at this granular level the data has a short shelf life… so it is often a requirement to provide near-real-time access to Big Data.

Because of the volume and low granularity of the data the business usually needs to use it in a summarized form. These summaries can be aggregates or they can be the result statistical summarization. These statistical summaries are the result of Big Data analytics. This is a key concept.

Before this data can be summarized it has to be collected… which requires the ability to load large volumes of data within business service levels. The Big Data requires data quality control at scale.

You may recognize these characteristics as EDW requirements; but where an EDW requires support for a heterogeneous environment with thousands of data subject areas and thousands and thousands of different queries that cut across the data in an ever-increasing number of paths, a Big Data store supports billions of homogeneous records in a single subject area with a finite number of specialized operations. This is the nature of an operational system.

In fact, a Big Data store is really an Operational Data Store (ODS)… with a twist. In order to evaluate changes over time the ODS must store a deep history of the details. The result is a Big Data Warehouse… or an Operational Big Data Store.

What is Big Data? No kidding this time…

I posted a little joke on this topic here… this time I’ll try to say some a little more substantive…

Big Data is the new, new, thing. The phrase is everywhere. A Google search on the exact words “Big Data” updated in the last year yields 39,300,000 results. The Wikipedia entry for Big Data suggests that big data is related to data volumes that are difficult to process. There is specific mention of data volumes that are beyond the ability to process easily with relational technology. Examples are typically listed of weblog data and sensor data.

I am not a fan of the if-its-so-big-its-difficult-to-handle line of thinking. This definition lets anyone and everyone claim to process Big Data. Even the Wikipedia article suggests that for small enterprises “Big Data” could be under a Terabyte.

Nor I am a fan of the anti-relational approach. I have seen Greenplum relational technology solve 7000TB weblog queries on a fraction of the hardware required by Big Data alternatives like Hadoop in a fraction of the processing time. If relational can handle 7PB+ then Big Data means web-scale size… 1000’s of petabytes and only Google-sized companies can contain it. Big Data seems smaller than that.

Maybe the answer lies in focusing on the “new” part? An Enterprise Data Warehouse (EDW) can be smallish or large… but there are new data subject areas in the Big Data examples that may not be appropriate for an EDW. Sensor data might not be usefully joined to more than a few dimensions from the EDW… so maybe it does not make sense to store it in the same infrastructure? The same goes for click-stream and syslog data… and maybe for call detail records and smart meter reads in telcos and utilities?

So Big Data is associated with new subject areas not conventionally stored in an EDW… big enough… and made up of atomic data such that there is little business value in placing it in the EDW. Big Data can stand alone… value derived from it may be added to the EDW. Deriving that value come from another new buzzword: Big Data Analytics… surely the topic of another note…

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: