HANA and ABAP

 

One more surprise…

In the past SAP applications have, in general, avoided using database features. Even a SELECT with a projection was out-of-bounds. They did not want to depend on any database, so they tended to pull all data from the data layer to the application layer and loop through the data using procedural languages like ABAP. You might say that they were religiously database agnostic. My mistake… you might say that we were religiously database agnostic. I have to get used to these new surroundings.

Besides the obvious attributes of HANA: in-memory, shared-nothing, MPP, and column-oriented… the aim is to move the application logic next to the data and into HANA.

Any of you who have labored to convert procedural code into set-based SQL will understand the issue here. There are hundreds of thousands or millions of lines of procedural code… often very simple loops… that have to be converted to SQL to make the HANA architecture support the SAP application portfolio.

The surprise is not that there is this outstanding issue.. nor is it the ambitious architecture designed to push the application deep into the database (we are not talking about SQL-based stored procedures… we are talking about the application). The surprise is that the HANA development team has built a state-of-the-art facility that programmatically converts procedural logic into its set-based equivalent (not necessarily into SQL but sometimes into a language that can execute in-parallel). This is not a tool requiring manual intervention… it is an automatic, mathematically provable, transformation.

Right now the technique is used to covert logic in stored-procedures and in ABAP. But I hope to see it applied in the optimizer to convert those ugly Oracle cursor loops on-the-fly.

You can read more here.

By the way… SAP will continue to support ABAP using the database as a file server… moving all of the data from the database server to the application server for processing. But you can imagine that… when running applications that use this powerful capability… over time HANA will emerge with a huge performance advantage over other databases…

Oracle should be worried.

 

Numbers Everyone Should Know

Some of you have seen me build simple models to do a reality-check on architecture (see here, for example). Here are some metrics from a great presentation by Jeff Dean, a Google fellow.

Numbers Everyone Should Know

L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns

Of note is the 120X difference between the cost of reading 1MB from memory and the cost of reading 1MB from disk…

The entire presentation may be found here: http://www.odbms.org/download/dean-keynote-ladis2009.pdf

Happy Modeling…

Who is Massively Parallel? HANA vs. Teradata and (maybe) Oracle

I have promised not to promote HANA heavily on this site… and I will keep that promise. But I want to share something with you about the HANA architecture that is not part of the normal marketing in-memory database (IMDB) message: HANA is parallel from its foundation.

What I mean by that is that when a query is executed in-memory HANA dynamically shards the data in-memory and lets each core start a thread to work on its shard.

Other shared-nothing implementations like Teradata and Greenplum, which are not built on a native parallel architecture, start multiple instances of the database to take advantage of multiple cores. If they can start an instance-per-core then they approximate the parallelism of a native implementation… at the cost of inter-instance communication. Oracle, to my knowledge, does not parallelize steps within a single instance… I could be wrong there so I’ll ask my readers to help?

As you would expect, for analytics and complex queries this architecture provides a distinct advantage. HANA customers are optimizing price models sub-second in-real-time with each quote instead of executing a once-a-week 12-hour modeling job.

June 11, 2013: You can find a more complete and up-to-date discussion of this topic here… – Rob

As you would expect HANA cannot yet stretch into the petabyte range. The current HANA sweet spot is for warehouses or marts is in the sub-TB to 20TB range.

Cloud Computing and Data Warehousing: Part 1 – The Architectural Issues

My apologies… I was playing with the iPad version of WordPress and accidentally published a very rough outline/first draft of this post. I immediately un-published it… but not before subscribers were notified that there was a new post.

I wonder about the idea that data warehousing is suited to operate in the cloud? This was prompted by Paraccel‘s venture to deploy on the Amazon EC2 cloud infrastructure. Lets work through the architectural implications…

Here are the assumptions I’ll take into this exploration:

  1. A shared-nothing architecture is required to scale.
  2. Cloud infrastructure is cost-effective when the infrastructure is under-utilized and workloads can be consolidated to achieve full utilization… and not so cost-effective when the infrastructure is highly utilized. This is because applications can easily share underutilized resources in the Cloud.
  3. Cloud infrastructure is justified when the workload is inconsistent and either CPU or storage requirements fluctuate widely over the business cycle. This is because a Cloud is elastic and can easily flex as the requirements fluctuate. Cloud computing may not be well suited to static workload requirements.

You can probably see where I’m going with this from the assumptions.

In the end I’ll suggest that there is a database architecture that is suited to warehousing and cloud computing… but let me build to that.

Before I start let me also be clear that I am talking about the database infrastructure… not the application/BI infrastructure required for data warehousing. The BI and ETL components are perfectly suited to cloud computing… they reflect a workload that, in general, runs on under-utilized hardware with BI running during the day and ETL running at night. I have suggested this to my current employer… but alas, I am neither King nor a member of Court.

So in Part 1 let me discuss my first two assumptions and the implications… In Part 2 I’ll discuss data warehousing and elasticity… In Part 3 I’ll consider the Paraccel/Amazon collaboration and in Part 4 I’ll wrap up and consider several new things coming that may change the equations.
—————-
I’ll not work too hard to justify my first assumption… I think that it is well-understood that a shared-nothing architecture provides the best possible approach to scale out. Google and others use this approach to scale to hundreds of petabytes of data and Teradata, Greenplum, Netezza, Paraccel, SAP HANA, and others use it in the data warehouse space. Exadata uses a hybrid approach that scales I/O in a shared-nothing-like storage subsystem… but fails to scale as it passes data to the RAC layer (see Kevin Closson here on the subject).

But the implications are significant for our cloud discussion. First, cloud infrastructure is designed to support general client-server or web-server based commercial computing requirements. A shared-nothing database cluster is a specialized infrastructure optimized for database processing. Implementing the specialized problem on the generalized infrastructure is possible, but sub-optimal. Next, cloud computing requires, more or less, a shared storage subsystem. A shared-nothing architecture shares nothing. Implementing a shared-nothing database on a shared storage subsystem is possible, but sub-optimal.

I believe that the second assumption is also pretty straightforward. The primary rationale for cloud computing comes from the recognition that many data centers deployed applications on servers that were not fully utilized. By virtualizing the hardware on a cloud platform the data center could better service the applications with fewer hardware resources and therefore less cost.

So… in order for cloud computing to be a perfect fit we need to observe a data warehouse database workload with underutilized hardware infrastructure… You might ask yourself… are there underutilized hardware resources upon which my EDW is built? In most cases I believe that the answer to this question will be “no”. Almost every EDW I’ve seen is over-burdened… stretched… with users demanding more and more resource… more data, more users, more queries, deeper queries drive the resource requirements up exponentially. The database is swamped all day with queries and swamped all night by ETL and reporting tasks.

So let’s end this blog concluding that there is a problematic architectural mismatch between a shared cloud and a shared-nothing implementation… and that if your warehouse database platform is highly utilized then there may be little benefit from implementing a warehouse in the cloud.

See Part 2 here

Exit mobile version
%%footer%%