Refactoring Databases P2.5: Scoping the Cheats

One of the side-effects of the little cheat posted here is that, if we are going to design early we have to decide what we will design early… and this question has two complications. First, we have to ask ourselves how detailed do we design before the result becomes un-agile? Next, we have to ask ourselves if we should design up the stack a little? My opinions will be doggedly blogged in this post.

I will offer two ends of a spectrum to suggest a way to manage the scope of your advance design cheat. Let me remind you that the cheat suggests that you look at the user stories that will be sprinted on next and devise the schema required from those stories… and maybe refactor the existing schema a little (more in the next post) no more than that.

On one side we may develop a complete design with every detail specified: subject areas, tables, columns, data types, and domains. The advantage here is that the code developers have a spec to code to and these could increase velocity. But the down side is that developers will be working with users to adjust the code in real-time. If the schema does not fit the adjustments then you may be refactoring the new stuff and velocity may decrease.

The other side of the spectrum would have database designers just build a skeleton; a conceptual schema with subject areas, tables and primary keys for each table. This provides a framework that corrals the developers without fencing them in so tight that they cannot express agility.

Remember that the object here is to reduce refactoring without reducing agility…

IMO the conceptual model approach is best. Let’s raise free range software engineers who can eat bugs in the wild rather than be penned in. The conceptual model delimits the range… a detailed schema defines a pen.

There is one more closely related topic for the next post… how do we manage transient objects in a restful application?

Refactoring Databases P2: Cheating a Little

In this post I am going to suggest doing a little design a little upfront and violating the purity of agility in the process. I think that you will see the sense.

To be fair, I do not think that what I am going to say in this post is particularly original… But in my admittedly weak survey of agile methods I did not find these ideas clearly stated… So apologies up front to those who figured this out before me… And to those readers who know this already. In other words, I am assuming that enough of you database geeks are like me, just becoming agile literate, and may find this useful.

In my prior post here I suggested that refactoring, re-working previous stuff due to the lack of design, was the price we pay to avoid over-engineering in an agile project. Now I am going to suggest that some design could eliminate some refactoring to the overall benefit of the project. In particular, I am going to suggest a little database design in advance as a little cheat.

Generally an agile project progresses by picking a set of user stories from a prioritized backlog of stories and tackling development of code for those stories in a series of short, two-week, sprints.

Since design happens in real-time during the sprints it can be uncoordinated… And code or schemas designed this way are refactored in subsequent sprints. Depending on how uncoordinated the schemas become the refactoring can require a significant effort. If the coders are not data folks… And worse, they are abstracted away from the schema via an ORM layer, the schema can become very silly.

Here is what we are trying to do at the Social Security Administration.

In the best case we would build a conceptual data model before the first sprint. This conceptual model would only define the 5-10 major entities in the system… Something highly likely to stand up over time. Sprint teams would then have the ability to agile define new objects within this conceptual framework… And require permission only when new concepts are required.

Then, and this is a better case, we have data modelers working 1-2 sprints ahead of the coders so that there is a fairly detailed model to pin data to. This better case requires the prioritized backlog to be set 1-2 sprints in advance… A reasonable, but not certain, assumption.

Finally, we are hoping to provide developers with more than just a data model… What we really want is to provide an object model with basic CRUD methods in advance. This provides developers with a very strong starting point for their sprints and let’s the data/object architecture to evolve in a more methodological manner.

Let me be clear that this is a slippery slope. I am suggesting that data folks can work 2-4 weeks ahead of the coders and still be very agile. Old school enterprise data modelers will argue… Why not be way ahead and prescribe adherence to an enterprise model. You all will have to manage this slope as you see fit.

We are seeing improvement in the velocity and quality of the code coming from our agile projects… And in agile… code is king. In the new world an enterprise data model evolves evolve based on multiple application models and enterprise data modelers need to find a way to influence, not dictate, data architecture in an agile manner.

Refactoring Databases P1: Defining Some Terms

We are in the middle of several agile projects at the SSA… so I’ll start the year by sharing some related issues and solutions we are considering…

I am going to try to suggest some ideas about refactoring databases that are a little different from some of the concepts in the book and blogs on the subject here and here. In this first post let me try to define refactoring in a general enough way that the same definition and use of the term works for both code and database design.

To start, refactoring could be a general term for incrementally tweaking any software. We might suggest that we have always refactored software more-or-less. But IMO the term has taken meaning as part of agile software development methods and so I will assume this is the proper use of the term.

Agile is a melting pot for several methodologies that emerged as a reaction to inefficiencies using stepwise waterfall methods. As a result agile has many features that make it useful… I’m going to focus on just one that I consider most relevant to refactoring: an agile method develops software incrementally with only a short-term end state as a target. Each increment adds new functionality and the system evolves. As a result, it is not possible, or not correctly agile, to establish a detailed design in advance and the system design evolves with the system.

This is a very hard concept to grasp. You cannot design up front for a system that has an undetermined end state. The waterfall concepts of design first must be modified to be agile. I’ll suggest how to do this in a later post… rest assured that some design is required… think about how we might build software with no design other than that inherited from the last set of sprints and with only the current sprint user stories to guide us.

If you have grokked this then you are on your way to understanding agile and refactoring.

Imagine that you have built a function that is seldom called… But the current sprint user story will call the function thousands of times a second. Imagine further that you built the original function simply in a stateful manner… But now, in order to meet the new scalability requirements, you realize that the function will need to be stateless. What you are imagining is the need to refactor the function.

Now you might ask: should you have known, in advance, that performance was going to be an issue? Maybe… But maybe not. The point is that when you design just-in-time in an agile manner you cannot, and should not, get too far ahead of yourself and over-engineer. Over-engineering is one of the side-effects of waterfall methods that agile aims to avoid…. And refactoring is the result… It is a trade-off not a perfect solution (again, bear with me and I’ll suggest another trade-off later that you might like).

So refactoring is the process that adjusts design incrementally in an agile project:

Refactoring is a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior.

Its heart is a series of small behavior preserving transformations. Each transformation (called a “refactoring”) does little, but a sequence of transformations can produce a significant restructuring. Since each refactoring is small, it’s less likely to go wrong. The system is kept fully working after each small refactoring, reducing the chances that a system can get seriously broken during the restructuring.

– Martin Fowler

Note that in the example I suggested where we refactored for performance we followed this definition closely… the refactored function was changed but its behavior was unchanged… no program that called the function was changed as a result.

Let me make two points here and close…

First: refactoring is about making changes to code and to databases that preserves the behavior of the code or the databases. Refactoring is not about new functionality with new behaviors that might be added incrementally as the project agilely progresses.

Next, refactoring is not about any incremental change… It is about incremental change in an agile project where the end state is uncertain enough to preclude a complete design. If we change a column in a database with some certainty that the column will satisfy a long-term vision then that change is not refactoring. Refactoring is not a process to guide a database migration or database modernization process.

When the end state is well understood it is silly to code stuff that you know will break later… And incrementally changing separate parts of a database that you are pretty certain will not change in the future is not refactoring.

This may seem obvious… but as you will see in the next post… the definitions matter.

Hadoop and ETL

My last post (here) blathered about the effect that Hadoop must have on database vendor profits. An associate wrote me with the reminder that Hadoop is also impacting revenues and profits of ETL companies.

If you think about Hadoop as both an inexpensive staging area for an EDW and as a parallel compute engine that can transform ungoverned, extracted data and load it into a governed EDW platform… then you are just one thought from realizing that these two functions have heretofore been in the domain of ETL… and that moving these functions to Hadoop might have an effect in the ETL space.

I do not believe that ETL tools will go away… but they may become just the GUI development environment that lets you quickly develop transformations and connect them into an end-to-end ETL process. The scheduling, processing engine, and monitoring could then be handled by the Hadoop eco-system.

Here is the idea from a previous post.

About five years ago the precursor to Alpine Data Labs, then an EMC Greenplum subsidiary, was developing a GUI for analytics that connected processes and I suggested they spin the product both into analytics and into ETL… I’ll have to look and see where they are these days…

Hadoop and Company Financial Performance

I have posted several times about the impact of the Hadoop eco-system on a several companies (here, here, here, for example). The topic cam up in a tweet thread a few weeks back… which prompts this quick note.

Fours years ago the street price for a scalable, parallel, enterprise data warehouse platform was $US25K-$US35K per terabyte. This price point provided vendors like Teradata, Netezza, and Greenplum reasonable, lucrative, margins. Hadoop entered the scene and captured the Big Data space from these vendors by offering 20X slower performance at 1/20th the price: $US1K-$US5K per terabyte. The capture was immediate and real… customers who were selecting these products for specialized, very large, 1PB and up deployments switched to Hadoop as fast as possible.

Now, two trends continue to eat at the market share of parallel database products.

First, relational implementations on HDFS continue to improve in performance and they are now 4X-10X slower than the best parallel databases at 1/10th-1/20th the street price. This puts pressure on prices and on margins for these relational vendors and this pressure is felt.

In order to keep their installed base of customers in the fold these vendors have built ever more sophisticated integration between their relational products and Hadoop. This integration, however, allows customers to significantly reduce expense by moving large parts of their EDW to an Annex (see here)… and this trend has started. We might argue whether an EDW Annex should store the coldest 80% or the coldest 20% of the data in your EDW… but there is little doubt that some older data could satisfy SLAs by delivering 4X-10X slower performance.

In addition, these trends converge. If you can only put 20% of your old, cold data in an Annex that is 10X slower than your EDW platform then you might put 50% of your data into an Annex that is only 4X slower. As the Hadoop relational implementations continue to add columnar, in-memory, and other accelerators… ever more data could move to a Hadoop-based EDW Annex.

I’ll leave it to the gamblers who read this to guess the timing and magnitude of the impact of Hadoop on the relational database markets and on company financial performance. I cannot see how it cannot have an impact.

Well, actually I can see one way out. If the requirement for hot data that requires high performance accelerates faster than the high performance advances of Hadoop then the parallel RDBMS folks will hold their own or advance. Maybe the Internet of Things helps here…. but I doubt it.

Database Super-computing

Today I am going to focus on a topic that I’ve suggested previously without the right emphasis: the new database architecture that uses vector processing on compressed columns to significantly accelerate performance.

The term “super-computing” was coined to describe the extreme hardware and software optimization developed to crunch numbers in scientific applications. As these technologies developed super-computer hardware evolved to leverage parallel microcomputers, software evolved to better leverage parallelism. Recently, microcomputers have started to incorporate the specialized instructions that support advanced mathematical applications. These super-computer instructions directly support vector algebra by manipulating strings of bits, vectors, in a single instruction. Finally, application developers recognized that these bit strings, these vectors, could be loaded into the microprocessors in a more effective manner to optimize their applications to the bare metal.

The effect of these optimizations accumulate for these applications as vectors compress and use memory more effectively, vectors load into processor cache more effectively, and vector instructions dramatically outperform integer instructions. The cumulative effect is that super-computer programs may be 10X-100X faster than commercial applications that provide the same result.

As this evolution progressed there was a similar evolution changing the architecture of database technology. Databases actually leveraged microcomputers before the high performance space made the move. But databases focused on the benefits of massively parallel I/O more than on the benefits of parallel compute. The drive to minimize the cost of I/O eventually led database developers to implement column store and then a very interesting discovery was made. Engineers recognized that a highly compressed column, a string of bits, could be processed as a vector.

Let’s see if we can make this 10X-100X number more than marketing foam. We can do this by roughly comparing the low-level processing of a chunk of data in integer and then in vector formats.

Let’s skip I/O processing and just focus on internals. This simplification greatly favors our integer DBMS. Keep in mind that the vector DBMS will process compressed vector data directly while the integer DBMS will expend resources to uncompress data and then take up 4X or more memory. This less efficient memory utilization will increase the chance that an I/O may be required and I/O is very expensive in the scenario we will discuss. Even an I/O on 1% of the time by the integer DBMS will provide a 1000X-100,000X advantage to the vector DBMS (see Figure 8 to gauge the latency to SSD or to disk).

Figure 8. Some Latency Metrics

So we’ll start with uncompressed integer data versus compressed vector data. We can assume that both databases are effective at populating cache. But the 4X compression advantage means that the vector processor is more likely to find data in the fast Level 1 cache and in the mid-range L2 cache. Given the characteristics outlined in Figure 8 we might suggest that the vector database is 4X more likely of finding data in cache than the integer database and that if we assume the latency of L2 cache as an estimate this results in a 15X-200X performance advantage.

Since data is in a vector form we can perform relational algebra and basic mathematics using vector algebra and vector addition. This provides another 8X-50X boost to the vector side

When we combine these advantages we see that a 10X-100X advantage is conservative. The bottom line is clear. A columnar database that effectively manages vectors into cache and further utilizes super-computing instructions will significantly out-perform an integer-based product.

The era of database super-computing has begun.

A New Way of Thinking About EDW Federation

There is a new way to think about data warehouse architecture. The Gartner Group calls it a logical data warehouse and it uses database federation to dynamically integrate a universe of data warehouses, operational data stores, and data marts into a single, united, structure. This blog has suggested that there is a special case of the logical data warehouse that uses Hadoop to provide a modern data warehouse architecture with significant economic advantages (see here, here, and here).


This is the second post inspired by my chats with Bityota… and sort of, but not altogether, commercial in nature (the first post is here). That is, Bityota will use these posts in their collateral… but you won’t see foam about their products in the narrative below.

– Rob


The economics are driven by the ability, through database federation, to place tables on a less expensive database platform. In short, the aim is to place data on the least expensive platform that still provides enough performance to satisfy service level agreements (SLAs). In the case of the modern data warehouse architecture this means placing older, colder, data on Hadoop where the costs may be $1000/TB instead of having all of the data in a single data warehouse platform at a parallel RDBMS price point of $35,000/TB. Federation allows these two layers to seem as one to any program or end-user.

This approach may be thought of as a data life cycle infrastructure that has significantly more economic power that the hardware based life cycles suggested by database vendors to date. Let’s consider some of the trade-offs that define the hardware approach and the Hadoop-based approach.

The power of Hadoop federation comes from its ability to manage data placement at a macro level. Here, data is placed appropriately into a separate database management system running on differentiated hardware so that the economics of the entire infrastructure: hardware, software, and network can be optimized. It is even possible to add a third or fourth level to provide more fine-grained economic optimization. But this approach come at a cost. Each separate database optimizes queries at the database level. Despite advances in federation software technology this split optimization cannot optimize many queries. The optimization is not poor, but it is not optimal optimization. The global optimization provided by a single DBMS will almost always out-perform federated optimization.

The temperature-based optimization touted by some warehouse vendors provides good global optimization. To date a single DBMS must run on a homogeneous hardware platform with a single price point. Queries run optimally but the optimization can only twiddle around hardware details placing data in memory or on an SSD device or on the faster portion of a disk platter.

Figure 7. Federated Elastic Shared-nothing IaaS

To eliminate this unfortunate trade-off: good optimization over minor hardware capabilities or fair optimization over the complete hardware eco-system we need a single DBMS that can run queries over a heterogeneous mix of hardware. We need a single database management system, with global query optimization, that can execute queries over multiple layers of hardware deployed in the cloud. We can easily imagine a multi-layered data warehouse with queries federated over several AWS offerings with hotter data on fast nodes that are always available, with warm data on less expensive nodes that are always available, and with cold historical data on inexpensive nodes that come online in processing windows so that you pay for the nodes only when you need them. Figure 7 shows a modern data warehouse deployed across an Amazon cloud.

This different way of thinking about a logical data warehouse is exciting… and a great example of how cloud computing may change everything in the database and data warehouse space.

An Elastic Shared-Nothing Architecture

In this post we will consider again the implications of implementing a shared-nothing architecture in the cloud. That is, we will start wondering about how to extend a static shared-nothing cluster deployed into an elastic hardware environment.


This is the first of three posts inspired by a series of conversations with the folks at Bityota (Bityota.com). After seeing the topics they asked if they could use the content in their marketing… so to be transparent… this is sort of a commercial post… but as you will see there is no promotional foam in the narrative.

– Rob


There is an architectural mismatch between Cloud Computing and a shared-nothing architecture.

In the Cloud: compute, processors and memory, scale independently of storage, disk and I/O bandwidth. This independence allows for elasticity: more compute can be dynamically added with full access to data on a shared disk subsystem. Figure 1 shows this relationship and depicts the elasticity that makes the Cloud so compelling.

Figure 1. Elastic Compute

In a shared-nothing architecture, compute and storage scale together as shown in Figure 2. This tight connection ensures that I/O bandwidth, the key to read performance, is abundant. But, in the end scalability is more about scaling I/O than about scaling compute. And this fact is due to the imbalance Moore’s Law injects into computer architecture… compute performance has far outstripped I/O performance over the years creating an imbalance.

Figure 2. Shared-nothing Bundles

To solve for this imbalance database engineers have worked very hard to avoid I/O. They invented indexing and partitioning and compression and column-store all with the desire to avoid I/O. When they could not avoid I/O they worked hard to minimize the cost by pre-fetching data into memory and, once fetched, by keeping data in memory as long as possible.

For example, one powerful and little understood technique is called the data flow architecture. Simply put data flow moves rows through each step of a query execution plan and out without requiring intermediate I/O. The original developers of Postgres, Sybase, SQL Server, Teradata, DB2, and Oracle did not have enough memory available to flow rows through so they spill data to the storage layer in between each step in the plan. Figure 3 shows how classic databases spill and Figure 4 shows how a more modern data flow architecture operates.

Figure 3. Classic Query Plan
Figure 4. Data Flow Query Plan

Why is this relevant? In a classic RDBMS the amount of I/O bandwidth available per GB of data is static. You cannot add storage without redistributing the data. So even though your workload has peaks and valleys your database is bottlenecked by I/O and this cannot flex. In a modern RDBMS most of the work is performed in memory without intermediate I/O… and as we discussed, compute and memory can elastically flex in a Cloud.

Imagine an implementation as depicted in Figure 5. This architecture provides classic static shared-nothing I/O scalability to read data from disk. However, once the read is complete and a modern data flow takes over the compute and memory is managed by a scalable elastic layer. The result is an elastic shared-nothing architecture that is well suited for the cloud.

Figure 5, Flowing to a Separate Compute Node

In fact you can imagine how this architecture might mature over time. In early releases a deployment might look like Figure 5 where the advantage of the cloud is in devising a cost-effective flexible configuration. As the architecture matures you could imagine a cloud deployment such as in Figure 6 where the 1:1 connection between storage nodes and compute nodes is broken and compute can scale dynamically with the workload.

Figure 6. Elastic Compute on a Shared-nothing Architecture

Cloud changes everything and it will significantly change database systems architecture.

It is strange to say… but the torch that fires innovation has been passed from the major database vendors to a series of small start-ups. Innovation seems to occur exclusively in these small firms… with the only recent exception being the work done at SAP on HANA.

How DBMS Vendors Admit to an Architectural Limitation: Part 3 – EDW on IBM z/OS

This is the 3rd and final example of a vendor admitting, without admitting, to an architectural limitation. The first two parts on Exadata and Teradata are here and here.

Teradata started to get real traction in the EDW space with a shared-nothing architecture in the late 1980’s. At that time the only real competition was DB2 on an IBM mainframe. From those days until just a couple of years ago IBM insisted that for MVS, then z/OS, customers should stick to the mainframe for their data warehouses and marts. There was some dabbling with sharded data in DB2 for z/OS… and Teradata made some in-roads… as did Netezza… but IBM insisted that there was no reason not to stay Blue. DB2 on AIX and then LINUX appeared… and both offered a better price/performance option than DB2 ob Z/OS… but the faithful stayed faithful for the most part.

Then IBM bought Netezza, a pure shared-nothing microprocessor-based machine, and the recommendation changed. Today IBM recommends the Analytics Accelerator, based on Netezza, to mainframe users who want to deploy an EDW. This is an admission, with no admission at all, that there was all along an architectural advantage to shared-nothingness.

If you search this blog for “Netezza” you can get my perspective on that technology. But to be blunt, the Analytics Accelerator is not IBM’s best EDW platform… DB2 LUW is by far… and with BLU LUW is better still.

I have made it clear in my previous posts that I consider it lazy for an IT shop to commit to a vendor or to a product. As engineers we need to embrace change. For IBM z/OS shops this means a realistic look at non-z/OS alternatives to deploy or to re-deploy an EDW. It makes no sense to build a data warehouse or a data mart directly on z/OS. Use the Analytics Accelerator or, better still, open the competition to better products like DB2 LUW, Teradata, Vertica, etc.

References

Database Fog Blog

Other

How DBMS Vendors Admit to an Architectural Limitation: Part 2 – Teradata Intelligent Memory

This is the second post (see Part 1 here) on how vendors adjust their architecture without admitting that the previous architecture was flawed. This time we’ll consider Teradata and in-memory….

When SAP HANA appeared Teradata went on the warpath with a series of posts and statements that were pointed but oddly miscued (see the references below). According to the posts in-memory was unnecessary and SAP was on a misguided journey.

Then Teradata announced Intelligent Memory and in-memory was cool. This is pretty close to an admission that SAP was right and Teradata was wrong. The numbers which drove Teradata here are compelling… 100K-200K ns to access an SSD device or 100 ns to access DRAM… a 1000X reduction… and the latency to disk is 100X worse than SSD.

Intelligent Memory was announced shortly after the release of Teradata’s columnar table type. Column-orientation is important because you need a powerful approach to compression to effectively use an expensive memory resource… and columnar provides this. But Teradata, like Greenplum, extended a row-based engine to support columns in order to get to market quick… they hoped to get 80% of the effectiveness of in-memory with only 20% of the engineering effort. The other 20% comes when you develop a new engine that fully exploits the advantages of a columnar architecture. These advanced exploits allow HANA, DB2 BLU, and Oracle 12c to execute directly on columnar data thereby avoiding decompression, fully utilizing the processor caches, and allowing sets to be operated on by super-computing vector-processing instructions. In fact, Teradata really applied the 50/20 rule… they gained 50%, maybe only 40%, of the benefits with their columnar and Intelligent Memory features… but it was easy to deploy what is in-effect an in-memory cache over their existing relational engine.

Please don’t jump to the wrong conclusion here… Intelligent Memory is a strong product. If you were to put hot data in memory, cool data in Teradata-on-SSD-or-Disk, and cold data in Hadoop and manage them as one EDW you could deploy a very cost-effective platform (see here).

Still, Teradata with Intelligent Memory is not likely to compete effectively against HANA, BLU, or 12c for raw performance… so there will be some marketing foam attached and an appeal for Teradata shops to avoid database apostasy and stick with them. You can see some of the foam in the articles below.

A quick aside here… generally a DBMS should win or lose based on price/performance. The ANSI standard makes a products features nearly, not completely but nearly, irrelevant. If you cannot win on price/performance then you blow foam. When any vendor starts talking about things like TCO you should grab your wallets… it is an appeal to foaminess to hide a weakness. I’m not calling out Teradata here… this is general warning that applies to every software vendor.

Intelligent Memory is a smart move. While it may not win in a head-to-head POC… it will be close-ish… close enough to keep the congregation in their pews. As readers know, I am not a big fan of technical religiosity… being a “Teradata-shop” is lazy… engineers we should pick the best solution and learn it. The tiered approach mentioned three paragraphs up is a good solution and non-Teradata shops should be considering it… but Teradata shops should be open to new technology as well. Still, we should pick new technology with a sensitivity to the cost of a migration… and in many cases Intelligent Memory will save business for Teradata by getting just close enough to make migration a bad trade-off. This is why it was so smart.

Back to the theme of these posts… Teradata back-tracked on the value of in-memory… and in the process admitted-without-admitting a shortcoming in their architecture. So it goes…

Next we will consider whether you should be building data warehouses on z/OS using DB2 or the DB2 Analytics Accelerator aka Netezza.

References

Database Fog Blog

Other

Exit mobile version
%%footer%%