Databases – Page 4 – Database Fog Blog

Microsoft SQL Server Announcements – November 2012

Here is one I composed for SAP on the HANA blog about the recent Microsoft SQL Server announcements that is not too obnoxiously pro-HANA. It is more about the data architecture required to handle a world where the client is a mobile device and every query must complete sub-second. This, I believe is where we are headed… taking those BI queries that run in an hour on weak warehouses and improving the response to 10 seconds won’t cut it if your user is on a mobile device… and if the query is customer-facing you will be out of business…

The only way to solve for this is to get lots of silicon between you and your data… and hope that no queries miss the cache… or put it all in-memory.

———

I might have added to the work post that anytime a database vendor pre-announces a product that is due out in 1-2 years, “2014-2015” in this case, it is marketing not architecture… meant to freeze SQL Server customers in place while Microsoft tries to catch up.

Make sure to have a look at the comments… there is a great link to a Microsoft mouthpiece who suggests that I must have no technical background and that I am a liar. Nice.

Teradata, HANA and NUMA

Teradata is circulating a document to customers that claims that the numbers SAP has published in its 100TB PoC white paper (here) demonstrates that HANA suffers from scaling issues associated with the NUMA-effect. The document is so annoyingly inaccurate that I have to respond.

NUMA stands for non-uniform-memory-access. This describes an architecture whereby each core in a multi-core system has some very fast local memory accessed directly through a memory bus… but has access to every other core’s local memory through a “remote” access hop over another fast bus. In the case of Intel Xeon servers the other fast bus is know as the QPI bus. “Non-uniform” means that all memory access are not equal… a remote access over the QPI bus is slower than access over the memory bus.

The first mistake in the Teradata document is where they refer to the problem as the “SMP Knee Curve”. SMP stands for symmetric multi-processing… an architecture where multiple cores share the same memory bus. The SMP Knee Curve describes the problem when too many cores are contending for the same bus. HANA is not certified to run on an SMP system. The 100TB PoC described above is not run on an SMP system. When describing issues you might expect Teradata to at least associate the issue with the correct hardware architecture.

The NUMA-effect describes problems scaling processors within a single NUMA node. Those issues can impact the ability to continuously add cores as memory locking issues across the QPI bus slow the system. There are ways to mitigate this problem, though (see here for some examples of how to code around the problem).

Of course HANA, which built an in-memory system with NUMA as a target from the start… has built in these NUMA mitigations. In fact, HANA is designed deeper still using special techniques to keep the processor caches filled and to invoke special-purpose SIMD instructions. HANA is built so close to the hardware that processor cycles that are unused due to cache misses but show up as processor busy are avoided (in other words, HANA will get more work done on a 100% CPU busy system than other software that will show 100% CPU busy). But Teradata chose to ignore this deep integration… or they were unaware of these techniques.

Worse still, the problem Teradata calls out… shouts out… is about scaling over 100 nodes in a shared-nothing configuration. The NUMA-effect has nothing at all to do with scale out across nodes. It is an issue within a single node. For Teradata to claim this is silliness at best. It is especially silly since the shared-nothing architecture upon which HANA is built is the same architecture Teradata uses.

The twists Teradata applies to the numbers are equally absurd… but I’ll stop here and hope that the lack of understanding they exhibit in throwing around terms like “SMP Knee Curve” and “NUMA-effect” will cast enough doubt that the rest of their marketing FUD will be suspect. Their document is surely not about architecture… it is weak marketing… you can see more here…

Netezza Workload Management

@henryccook made an interesting point regarding Netezza workload management this morning… He suggested that once a SPU is engaged by a snippet the work must be completed before another snippet can start. To say this another way… a SPU has no OS and cannot save context for a snippet and start another… then return.

If this is true it means that if a long-running snippet starts… a full file scan of a fact table with no use of the zone map… then that snippet will lock out others queries until it completes.

This is not a very fine-grained approach to workload management and we would expect it to cause difficulties.

Can anyone confirm that this is true? It feels right from an architectural perspective…

Data Recovery in HANA, TimesTen, and SQLFire

There is a persistent myth, like a persistent cough, that claims that in-memory databases lose data when a hardware failure takes down a node because memory is volatile and non-persistent. This myth is marketing, not architecture.

Most RDBMS products: including Oracle, TimesTen, and HANA; have three layers where data exists: in-memory (think SGA for Oracle), in the log, and on disk. The normal process goes like this:

A write transaction arrives
The transaction is written to the log file and committed… this is a very quick process with 1 sequential I/O… quicker still if the log file is on a SSD device
The query updates the in-memory layer; and
After some time passes, saves the in-memory data to disk.

Recovery for these databases is easy to understand:

If a hardware failure occurs and #1 but before #2 the transaction has not been committed and is lost.
If a hardware failure occurs after #2 but before #3 the transaction is committed and the database is rebuilt when the node restarts from the log file.
If a hardware failure occurs after #3 but before #4 the same process occurs… the database is rebuilt when the node restarts from the log file.
If a hardware failure occurs after #4 the database is rebuilt from the disk copy.

SQLFire uses a different approach (from here):

“Unlike traditional distributed databases, SQLFire does not use write-ahead logging for transaction recovery in case the commit fails during replication or redundant updates to one or more members. The most likely failure scenario is one where the member is unhealthy and gets forced out of the distributed system, guaranteeing the consistency of the data. When the failed member comes back online, it automatically recovers the replicated/redundant data set and establishes coherency with the other members. If all copies of some data go down before the commit is issued, then this condition is detected using the group membership system, and the transaction is rolled back automatically on all members.”

Redundant in-memory data optimizes transaction throughput but requires twice the memory. There are options to persist data to disk… but these options provide an approach that is significantly slower than the write-ahead logging used by TimesTen and HANA (and Oracle and Postgres, and …).

The bottom line: IMDBs are designed in the same manner as other, disk-based, DBMSs. They guarantee that comitted data is safe… everytime.

P.S.

See here for how these DBMSs compare when a BI/analytic workload is applied.

SQLFire, Exalytics, TimesTen, and HANA… a quick comparison

As you may have noticed I’m looking at in-memory databases (IMDB) these days… Here are some quick architectural observations on VMWare‘s SQLFire, Oracle’s Exalytics and TimesTen offerings, and SAP HANA.

It is worth noting up front that I am looking to see how these products might be used to build a generalized data mart or a data warehouse… In other words I am not looking to compare them for special case applications. This is important because each of these products has some extremely cool features that allow them to be applied to application-specific purposes with a narrow scope of data and queries… maybe in a later blog I can try to look at some narrow use-cases.

Further, to make this quick blog tractable I am going to assume that the mart/dw problem to be solved requires more data than can fit on one server node… and I am going to ignore features that let queries access data that resides on disk… in-memory or bust.

Finally I will assume that the SQL dialect supported is sufficient and not drill into details there. I will look at architecture not SQL features…

Simply put I am going to look at a three characteristics:

Will the architecture support ad hoc queries?
Does the architecture support scale-out?
Can we say anything with regards to price/performance expectations?

Exalytics is a smart-aggregate store that sits over an Oracle database to offload aggregate query workload (see my previous post here or the Rittman Mead post here which declares: “Oracle Exalytics uses a specially enhanced version of Oracle TimesTen, Oracle’s in-memory database, to cache commonly used aggregates used in dashboards, analyses and other BI objects.” Exalytics does not support a scale-out shared-nothing architecture but it can scale up by adding nodes with new aggregate data. Queries access data within the aggregate structure and it is not possible to join to data off the Exalytics node… so ad hoc is out. Within these limits, which preclude Exalytics from being considered as a general platform for a mart or warehouse, Exalytics provides dictionary-based compression which should provide around 5X compression to reduce the amount of memory required and reduce the amount of hardware required.

TimesTen can do more. It is a general RDBMS. But it was designed for OLTP. I assume that the reason that Oracle has not rolled it out as a general-purpose data mart or data warehouse has to do with constraints that grow from those OLTP architectural roots. For example, BI queries run longer and require more data than a OLTP query… and even with data in-memory temporary storage is required for each query… and memory utilization is a product of the amount of data required and the amount of time the data has to inhabit memory… so BI queries put far more pressure on an in-memory DBMS. There are techniques to mitigate this… but you have to build the techniques in from the ground up.

I imagine that this is why TimesTen works for Exalytics, though. A OLAP query against a pre-aggregated cube does not graze an entire mart or warehouse. It is contained and “small data” (for my wacky take re: Exalytics and Exadata see here).

TimeTen is not sharded… so scalability is an issue. Oracle gets around this nicely by allowing you to partition data across instances and have the application route queries to the appropriate server. But this approach will not support joins across partitions so it severely limits scalability in a general-purpose mart or warehouse.

SQLFire is a very interesting new product built on top of Gemfire… and therefore mature from the start. SQLFire is more scalable than TimesTen/Exalytics. It supports sharded data in a cluster of servers. But SQLFire has the limitation that it cannot join data across shards (they call them partitions… see here) so it will be hard to support ad hoc queries… They provide the ability to replicate tables to support any sort of joins. If, for example, you replicate small dimension tables to coexist with sharded fact tables all joins are supported. This solution is problematic if you have multiple fact tables which must be joined… and replication of data uses more memory… but SQLFire has the foundation in place to become BI-capable over time.

Performance in an in-memory database comes first and foremost from eliminating disk I/O. All three IMDB product provide this capability. Then performance comes from the efficient use of compression. TimeTen incorporates Oracles dictionary-based “columnar” compression (I so hate this term… it is designed to make people think that Oracle products are sort-of columnar… but so far they are not). Then performance comes from columnar projection… the ability to avoid touching all data in a row to process a query. Neither TimesTen nor SQLFire are columnar databases. Then performance comes from parallel execution. Neither TimesTen nor SQLFire can involve all cores on a single query to my knowledge.

Price comes from compression as well. The more highly compressed the data is the less memory required to store it. Further, if data can be used without decompressing it, then less working memory is required. As noted, TimesTen has a compression capability. SQLFire does not appear to compress data. Neither can use compressed data. Note that 2X compression cuts the amout of memory/hardware required in half or more… 4X cuts it to a quarter… and so on. So this is significant.

Now for some transparency… I started the research for this blog, and composed a 1st draft, last Spring while I was at EMC Greenplum. I am now at SAP working with HANA. So… I will not go into HANA at great length… but I will point out that: HANA fully supports a shared-nothing architetcture… so it is fully scalable; HANA is fully parallel and able to use all cores for each query; HANA fully supports columnar tables so it provides deep compression and the ability to use the compressed data in execution. This is not remarkable as HANA was designed from the bottom up to support both BI and OLTP workloads while TimesTen and SQLFire started from a purely OLTP architectural foundation.

References:

vFabric SQLFire User’s Guide

Oracle Times Ten In-Memory Database Architectural Overview

30+ Year Old Database Architecture: DB2, Oracle, Postgres, Teradata, Sybase, and More…

As you look at the enterprise RDBMS marketplace today you will find something shocking… almost every product in the market is built based on designs and concepts that are over thirty years old. IBM’s System R grew into DB2 and influenced Oracle before 1980. Ingres, developed before 1980, became Postgres which became Netezza and Greenplum and more. Teradata was a fresh start… around 1980.

This is not a bad thing in its own right… but imagine the hardware architectures these systems were designed and optimized for. Maybe DB2 was built for a multi-core mainframe… maybe Oracle too… maybe. Memory was tiny… so memory management was important and memory was used sparingly. Data sizes were tiny. Consider the fact that Teradata named the company based on the belief that someday way beyond the planning horizon some customers might get to a terabyte of data.

The reality is that these old designs are inefficient. They have hacked the old code to continuously extend their products. I mean this as a compliment. It is not trivial engineering to find tweaks and tack-ons that make old code work on new hardware architectures. Teradata and Netezza and Greenplum designed ways to use multiple address spaces to take advantage of multiple cores. Oracle tacked-on a shared-nothing I/O subsystem to a shared-everything architecture to stretch.

But these hacks are not efficient.

Yale is working on some new-new stuff (see here). HANA is based on a completely different design (see here). The NoSQL vendors have bent the ACID-tested rules, if not always the fundamental approaches.

I can’t help but believe that in one of these new approaches is a path forward.

If you would like to read some history of the start here is a cool link.

The Rational Economics of In-memory Databases (Is memory getting cheaper faster than Data Warehouses are getting bigger?)

I have just written a commercial blog for work refuting some silliness from Teradata here and here. Since some of this refutes an argument that targets in-memory database architecture in general it is worth restating the case here.

The Teradata argument states that since data warehouses are growing 40% per year and the cost of memory is dropping only 20% per year that the economics of in-memory databases (IMDB) is “irrational” and that the whole IMDB idea is “hype”. Let’s have a look at the Teradata argument…

First, let’s imagine a 100TB data warehouse that is built today… and let’s imagine that it is economically reasonable today. There is an explicit argument for this here and an implicit argument here… but since the Teradata argument says that the IMDB economics get worse over time it really doesn’t matter where we start. If Teradata is right then time will tell.

Now lets apply Teradata’s economics for a couple of years…

Next year, according to Teradata, the data warehouse will have grown to 140TB and the cost of memory will have dropped 20%… making IMDB more economic. The following year your data warehouse will have grown to about 200TB and the cost of memory will have dropped another 20% making the IMDB even more cost-effective. The following year the DW will be 280TB and the cost of memory will have dropped another 20% making it even more cost-effective.

In other words, the Teradata sound bite is silly. It has emotional appeal… but it is nonsense.

But there is more. Moore’s Law does not say that price will fall 2X every 2 years… it suggests that performance (actually transistor density) will improve 2X every two years. The fact is that memory prices are falling AND memory speeds are improving… and the gap between memory speeds and disk speeds is increasing. So the gap in price/performance of an IMDB vs. a disk-based system is increasing exponentially.

These are the economics that matter… and these are the economics that are driving Teradata to put silicon in-between their disks and their processors.

Teradata’s argument is marketing, not architecture.

Decision Support Redux

In the late 1980’s and the early 1990’s the term for software that business users executed to run reports, fire off canned queries, and/or to explore data ad hoc was called “decision support” software. Later, and still today, the term “business intelligence” came into use.

I never understood the sense of the switch. The term “business intelligence” is vague… sort of fluffy and pretentious. “Decision support” implies a purpose. In the years when the switch from one term to the other was in progress, if you asked the question: what do you mean by “business intelligence” the answer was… it is “decision support”.

Today the analytics that underlie both terms are becoming more sophisticated, and they execute in near-real-time. It could be said that there is business intelligence in the process that acquires data, analyzes it, discovers a pattern, and applies a rule automatically as a result. But the software programmer who built the system was focused on automating the decision process… not on creating intelligence.

A clear focus on supporting complex decisions will increase the chances of delivering a return on your investment in analytics. “Intelligence” is not useful unless it is applied to make a better decision. I vote for a return to the phrase “decision support”.

Numbers Everyone Should Know

Some of you have seen me build simple models to do a reality-check on architecture (see here, for example). Here are some metrics from a great presentation by Jeff Dean, a Google fellow.

Numbers Everyone Should Know

L1 cache reference	0.5 ns
Branch mispredict	5 ns
L2 cache reference	7 ns
Mutex lock/unlock	25 ns
Main memory reference	100 ns
Compress 1K bytes with Zippy	3,000 ns
Send 2K bytes over 1 Gbps network	20,000 ns
Read 1 MB sequentially from memory	250,000 ns
Round trip within same datacenter	500,000 ns
Disk seek	10,000,000 ns
Read 1 MB sequentially from disk	20,000,000 ns
Send packet CA->Netherlands->CA	150,000,000 ns

Of note is the 120X difference between the cost of reading 1MB from memory and the cost of reading 1MB from disk…

The entire presentation may be found here: http://www.odbms.org/download/dean-keynote-ladis2009.pdf

Happy Modeling…

Real-time Analytics and BI: Part 1 – Singing for my Dinner

Several months ago I was invited to a dinner attached to a data science summit… with the price being that I had to deliver a 5 minute talk… I had to sing for my dinner. The result was this thinking on real-time analytics and the Toyota Prius.

Real-time analytics implies two things:

Changes in the data are evaluated continuously; and
The results of the analysis are used or displayed continuously.

In a Toyota Prius we can see two examples of real-time analytics.

The first is in the anti-lock braking system. There data reflecting the pressure on the brake pedal and on rotation of each wheel is sent to a computer that analyzes the results and adjusts the brake pressure on each wheel so that all four wheels turn at the same rate and the car stops in a straight line.

Note that the analytics are real-time and the results are used immediately without human intervention. This is important. It makes little sense to spend the money to capture and analyze data in real-time if the results are not actionable in near-real-time.

Think for a moment about the BI systems built over the last 20 years. First we captured and analyzed monthly data… and acted on that data within a 30-day window. Then we increased the granularity of the data to weekly and slightly adjusted the reports to reflect the finer granularity… and acted on the data within 7 days. Then we adjusted the data to daily and acted on the results each day. Then we adjusted the data to hourly and reacted even more quickly. These changes often did not fundamentally change the business processes driven by the data… they just made the processes more sensitive to the fine-grained information.

But if the data-driven business process takes ten minutes to complete… for example it takes ten minutes for staff to pick inventory, package the results, and load a delivery truck; could there be a return on the investment expense of developing a continuous, real-time analytic? I think not. There may, however, be ROI associated with a new robotic pick, package, and load process…

There is another possibility… If sometimes the pick, package, and load takes ten minutes and sometimes it takes fifteen minutes then the best solution is to perform the analytics on the current state on-demand… when there are resources to support the process. This maximizes the use of the resources without changing the business process.

The point here is that real-time requires a re-think… or at least a deep-think. The business process may have to change significantly to support real-time analytics.

The second real-time system in the Prius illustrates the problem. On the dashboard the Prius displays, in real-time, the state of the hybrid gas-electric system. It shows whether the battery is charging or discharging… it shows whether the car is being driven using the electric or the internal-combustion engine. It is one of the most beautiful dashboard displays you have ever seen… and executives everywhere must look at it and wonder why they cannot get such a beautiful display of the state of their business… after-all… BI dashboards are “the thing”.

But the Prius display is useless. There is no action you would take while driving based on this real-time display.From a decision-making view it represents useless and expensive flash (that helps to sell the Prius…).

So… approach real-time analytics with a deep-think. Look for opportunities like the anti-lock braking system where real-time analytics can be embedded into automatic business processes. Avoid flashy dashboards that do not present actionable data.

In-memory databases (IMDB) such as SAP HANA, Oracle TimesTen, and VMWare SQLFire promise to enable real-time analytics… and this promise is real… the opportunities can and will revolutionize the enterprise over time… but a revolution is not the same old BI at a finer granularity… it is much more significant than that. Heads will roll.

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: