The Hype of Big Data

Hype Cycle for Emerging Technologies 2010
Hype Cycle for Emerging Technologies 2010 (Photo credit: marcoderksen)

As preface to this you might check out the definition I suggested for Big Data last week here… – Rob

I left Greenplum in large part because they made their mark in… and then abandoned… the  data warehouse market for a series of big hype plays: first analytics and data science; then analytics, data science, and Hadoop; then they went “all-in”, their words, on Big Data and Hadoop… and now they are part of Pivotal and in a place that no-one can clearly define… sort of PaaS where Greenplum on HDFS is a platform.

It is not that I am a Luddite… I pretend each time I write this blog that I am in tune with the current and future state of the database markets… that I look ahead now and then. I just thought that it was unlikely for Greenplum to be profitable by abandoning the market that made them. At the time I suggested to them an approach that was founded in data warehousing but would let them lead in the hyped plays… and be there in front when, and if, those markets matured.

Now, if we were to define markets in an unambiguous manner:

  • a data warehouse database is primarily accessed through one or more BI tools;
  • an analytic database is primarily accessed through a statistical tool; and
  • Hadoop requires Hadoop;

then I suspect that the vast majority of Greenplum revenues still,  3-4 years after the move away from data warehousing, come from the DW market. It is truly a shame that this is not the focus of their engineering team and their marketeers.

Gartner has called it pretty accurately in their 2013 Hype Cycle for Emerging Technologies here. Check out where Big Data is on the curve and how long until it reaches the mainstream. Worse, here is a drill-down showing the cycle for just Big Data. Look at where Data Science sits and when they expect it to plateau. Look at where SQL for Hadoop is in the cycle.

Big Data is real and upcoming… but there is no concise definition of Big Data… no definition that does not overlap technologies that have been around since before the use of the term. There is no definition that describes a technology that the Fortune 1000 will take mainstream in the next 2-3 years. Further, as I have suggested here and here, open source products like Hadoop will annihilate the commercial market for big analytic databases and squeeze hard the big EDW DBMS players. It is just not a commercially interesting space… and it may not become commercially interesting if open source dominates (unless you are a services company).

Vendors need to be looking hard at Big Data now if they want to play in 2-3 years. They need to be building Big Data integration into their products and they need to be building Big Data apps that take the value straight into the business.

Users need to be looking carefully for opportunities to use Hadoop to reduce costs… and, in highly competitive markets which naturally generate lots of machine-to-machine data, they need to look for opportunities to get ahead of the competition.

But both groups need to understand that they are on the wrong side of the chasm (see here for reference to Crossing the Chasm)… they have to be Early Adopters with a culture that supports an early adopter business model.

We all need to avoid the mistake described in the introduction. We need to find commercially viable spots in an emerging technology play where we can deliver profits and ROI to our organizations. It is not that hard really to see hype coming if you are paying attention… not that hard to be a minor visionary. It is a lot harder to turn hype into profits…

The Big Data Devil

Devil (Photo credit: Wikipedia)

I just finished a draft for next week on Big Data and thought that with this note I might form a preface…

First… Big Data is about, well…, Big Data. When Gartner devised the three V’s I suspect that they were trying to frame the new stuff that was emerging… not establish a concise definition. So let me be very clear about what I think that Big Data is and is not.

Big Data is about volume, not velocity, not variety. That is what the words “big” and “data” conjoined must mean. Velocity + Volume is Big Data. Variety + Volume is Big Data. By themselves Velocity and Variety are new, important, separate, technological trends.

Next, Big Data is a new thing. It is not a technology that was around in a meaningful way 5+ years ago. It was emerging just then so we should see evidence in the advances offered by the Web Scale companies like Google, Yahoo, and Netflix. It is not any data that was conventionally created, captured, or used before 2010 or so.

So what is new, big, and was emerging in recent history? It is the creation, capture, and use of machine-generated data: click-stream data, system log data, and sensor data. Big Data technology has to do with the creation, capture, and utilization of large volumes of machine-generated data… nothing more or less.

Rob: Big Data legitimately includes Social Data as well as Vitaliy rightfully commented… I’ll post on this soon…

Machines generate data at a very low-level of detail. It is said that the devil is in the detail… and the subject of the next post deals with the notion that in order to make our companies more profitable we must all chase this damnable devil.

PS

I wonder if damnable devil is redundant? Probably, yes.

2nd PS (sort of like 2nd breakfast)

Big Data is not about any and every new technology introduced in the last five years…

IBM BLU and SAP HANA

Weird blue dot (Photo credit: awshots)

As I noted here, I think that the IBM BLU Accelerator is a very nice piece of work. Readers of this blog are in the software business where any feature developed by any vendor can be developed in a relatively short period of time by any other vendor… and BLU certainly moves DB2 forward in the in-memory database space led by HANA… it narrowed the gap. But let’s look at the gap that remains.

First, IBM is touting the fact that BLU requires no proprietary hardware and suggests that HANA does. I do not really understand this positioning? HANA runs on servers from a long list of vendors and each vendor spins the HANA reference architecture a little differently. I suppose that the fact that there is a HANA reference architecture could be considered limiting… and I guess that there is no reference for BLU… maybe it runs anywhere… but let’s think about that.

If you decide to run BLU and put some data in-memory then certainly you need some free memory to store it. Assuming that you are not running on a server with excess memory this means that you need to buy more. If you are running on a blade that only supports 128GB of DRAM or less, then this is problematic. If you upgrade to a 256GB server then you might get a bit of free memory for a little data. If you upgrade to a fat server that supports 512GB of DRAM or more, then you would likely be within the HANA reference architecture set. There is no magic here.

One of the gaps is related: you cannot cluster BLU so the amount of data you can support in-memory is limited to a single node per the paragraphs above. HANA supports shared-nothing clustering and will scale out to support petabytes of data in-memory.

This limit is not so terribly bad if you store some of your data in the conventional DB2 row store… or in a columnar format on-disk. This is why BLU is an accelerator, not a full-fledged in-memory DBMS. But if the limit means that you can get only a small amount of data resident in-memory it may preclude you from putting the sort of medium-to-large fact tables in BLU that would benefit most from the acceleration.

You might consider putting smaller dimension tables in BLU…. but when you join to the conventional DB2 row store the column store tables are materialized as rows and the row database engine executes the join. You can store the facts in BLU in columnar format… but they may not reside in-memory if there is limited availability… and only those joins that do not use row store will use the BLU level 3 columnar features (see here for a description of the levels of columnar maturity). So many queries will require I/O to fetch data.

When you pull this all together: limited available memory on a single node, with large fact tables projecting in and out of disk storage, and joins pushed to the row store you can imagine the severe constraint for a real-world data warehouse workload. BLU will accelerate some stuff… but the application has to be limited to the DRAM dedicated to BLU.

It is only software… IBM will surely add BLU clustering (see here)… and customers will figure out that they need to buy the same big-memory servers that make up the HANA reference architecture to realize the benefits…  For analytics, BLU features will converge over the next 2-3 years to make it ever more competitive with HANA. But in this first BLU release the use of in-memory marketing slogans and of tests that might not reflect a real-world workload are a little misleading.

Right now it seems that HANA might retain two architectural advantages:

  1. HANA real-time support for OLTP and analytics against a single table instance; and
  2. the performance of the HANA platform: where more application logic runs next to the DBMS, in the same address space, across a lightweight thread boundary.

It is only software… so even these advantages will not remain… and the changing landscape will provide fodder for bloggers for years to come.

References

  • Here is a great series of blogs on BLU that shows how joins with the row store materializes columns as rows…

Thinking about BI: Infographics is the next phase…

Infographics by The Guardian (Photo credit: tripu)

I have been thinking about BI… prompted by a friend, Frank Bien, who is the CTO of Looker (you are welcome, Frank, for the plug…) but this post is about a  trend in BI that is worth exploring… and only maybe about Looker or any other tool.

BI was originally about reporting… in its very first iterations users coded SQL directly, or used a 4GL scripting language, and the BI tool was there for formatting the output. Then more focus was put on query-building to make it easier for developers to effectively get the required data.

There was lots of talk at this point about knowledge-workers and do-it-yourself BI… but it never really worked out that way. Business users requested canned reports and went to a query guru to request special reports as required.

The following was pretty normal: after finding some interesting fact in a tabular report a business user would pull the data and build a Powerpoint slide to present the results. As interfaces improved you soon could access the data directly and create Excel or Powerpoint charts without copying the data by hand. In other words data visualization was separated from reporting.

The BI vendors caught on to this, recognizing that data presentation is important, and soon all of the BI tools offered some charting options. But the next step was equally interesting. Charting is a bit of an art… so the BI tools programmed in some directives to help you select the chart that fit your data and that would be visually pleasing. So simple charting as visualization was built-in with some simple assistance to help you with the simple art of presentation.

From here vendors went in two directions, one after the other.

First, dashboards were developed that were customizable and these applications, either semi-static or dynamic, caught your eye. Red lights and limits could be built into a heads-up display. The art of presentation was pretty crude and loudness: bright colors and lots of moving dials and eye-candy won the day. But as far as presentation goes, dashboards are just multiple simple charts arranged on a screen. QlikView led this charge.

Next, a new set of visualization products were rolled out by vendors like Tableau and Pentaho. Users saw that some very powerful pictures could be drawn showing data in a time series… and showing the series changing over time. Since the presentation was more nuanced and more “artistic” the automated assistance required more sophistication and this is where the vendors are now fighting to differentiate themselves.

But an interesting thing was happening outside BI… and this is the point of this note. In the same way that PowerPoint led reporting to charting, a new presentation technique called infographics is emerging. It is the state of the art in data visualization… and Powerpoint… and art… rolled into one. And it is very impactful. I imagine that the next wave of BI tools must embrace this more advanced presentation technique.

Here is how I think this plays out.

The advanced data visualization vendors will provide a palette that directly accesses data, and big data, to allow very custom infographics… you can see some of this at Piktochart although it is more about templates than free form development. But since this is art… tools will be developed that help the art-impaired like me to build nice displays and they will do this by analyzing the data and recommending one or more meaningful infographic displays.

So, maybe in the same way that Powerpoint data presentation anticipated charting in BI tools… Infographics anticipates the next data presentation facility for BI.

I suppose that this is not really a controversial conclusion… and I imagine that if I took the time I could find several start-ups who are way ahead of me on this… but sometimes it’s more fun to daydream it up on my own… and pretend that I’m out front…

Thoughts on Oracle 12c…

Plugs (Photo credit: Brad.K)

 

Here are some quick thoughts on Oracle 12c…

 

First, I appreciate the tone of the announcements. They were sober and smart.

 

I love the pluggable database stuff. It fits into the trends I have discussed here and here. Instead of consolidating virtual machines on multi-core processors and incurring the overhead of virtual operating systems Oracle has consolidated databases into a single address space. Nice.

 

But let’s be real about the concept. The presentations make it sound like you just unplug from server A and plug into server B… no fuss or muss. But the reality is that the data has to be moved… and that is significant. Further, there are I/O bandwidth considerations. If database X runs adequately on A using 5GB/sec of read bandwidth then there better be 5GB/sec of free bandwidth on server B. I know that this is obvious… but the presentations made it sound magic. In addition 12c added heat maps and storage tiering… but when you plug-in the whole profile of what is hot for that server changes. This too is manageable but not magic. Still, I think that this is a significant step in the right direction.

 

I also like the inclusion of adaptive execution plans. This capability provides the ability to change the plan on-the-fly if the execution engine determines that the number of rows it is seeing from a step differs significantly from the estimate that informed the optimizer. For big queries this can improve query performance significantly… and this is especially the case because prior to 12c Oracle’s statistics collection capability was weak. This too has been improved. Interestingly the two improvements sort of offset. With better statistics it is less likely that the execution plan will have to adapt… but less likely does not mean unlikely. So this feature is a keeper.

 

I do not see any of the 12c major features significantly changing Oracle’s competitive position in the data warehouse market. If you run a data warehouse flat-out you will not likely plug it elsewhere… the amount of data to move will be daunting. The adaptive execution plan feature will improve performance for a small set of big queries… but not enough to matter in a competitive benchmark. But for Oracle shops adaptive execution is all positive.

 

HANA Memory Utilization

The current release of HANA requires that all of the data required to satisfy a query be in-memory to run the query. Let’s think about what this means:

HANA compresses tables into bitmap vectors… and then compresses the vectors on write to reduce disk I/O. Disk I/O with HANA? Yup.

Once this formatting is complete all tables and partitions are persisted to disk… and if there are updates to the tables then logs are written to maintain ACIDity and at some interval, the changed data is persisted asynchronously as blocks to disk. When HANA cold starts no data is in-memory. There are options to pre-load data at start-up… but the default is to load data as it is used.

When the first query begins execution the data required to satisfy the query is moved into memory and decompressed into vectors. Note that the vector format is still highly compressed and the execution engine operates on this compressed vector data. Also, partition elimination occurs during this data move… so only the partitions required are loaded. The remaining data is on disk until required.

Let us imagine that after several queries all of the available memory is consumed… but there is still user data out-of-memory on peripheral storage… and a new query is submitted that requires this data. At this point HANA frees enough storage to satisfy the new query and processes it. Note that, in the usual DW case (write-once/read-many), the data flushed from memory does not need to be written back…  the data is already persisted… otherwise HANA will flush any unwritten changed blocks…

If a query is submitted that performs a cartesian product… or that requires all of the data in the warehouse at once… in other words where there is not enough memory to fit all of the vectors in memory even after flushing everything else out… the query fails. It is my understanding that this constraint will be fixed in a next release and data will stream into memory and be processed in-stream instead of in-whole. Note that in other databases a query that consumes all of the available memory may never complete, or will seriously affect all other running queries, or will lock the system… so the HANA approach is not all bad… but as noted there is room for improvement and the constraint is real.

This note should remove several silly arguments leveled by HANA’s competitors:

  • HANA, and most in-memory databases, offer full ACID-compliance. A system failure does not result in lost data.
  • HANA supports more data than will fit in-memory and it pages data in-and-out in a smart fashion based on utilization. It is not constrained to only data that fits in-memory.
  • HANA is not useless when it runs out of memory. HANA has a constraint when there is more data than memory… it does not crash the system… but lets be real… if you page data to disk and run out of disk you are in trouble… and we’ve all seen our DBMS‘s hit this wall. If you have an in-memory DBMS then you need to have enough memory to support your workload… if you have a DB2 system you better not run out of temp space or log space on disk… if you have Teradata you better not run out of spool space.

I apologize… there is no public reference I know of to support the features I described. It is available to HANA customers in the HANA Blue Book. It is my understanding that a public version of the Blue Book is being developed.

Database Computing is Supercomputing… Some external reading: May 2013

Superman: Doomsday & Beyond (Photo credit: Wikipedia)

I would like to recommend to you John Appleby’s post  here on the HANA blog site. While the title suggests the article is about HANA, in fact it is about trends in computing and processors… and very relevant to posts here past, present, and upcoming…

I would also recommend Curt Monash’s site. His notes on Teradata here mirror my observation that a 30%-50% performance boost per release cycle is the target for most commercial databases… and what wins in the general market. This is why the in-memory capabilities offered by HANA and maybe DB2 BLU are so disruptive. These products should offer way more than that… not 1.5X but 100X in some instances.

Finally I recommend “What Every Programmer Should Know About Memory” by Ulrich Drepper here. This paper provides a great foundation for the deep hardware topics to come.

Database computing is becoming a special case, a commercial case, of supercomputing… high-performance computing (HPC) to those less inclined to superlatives. Over the next few years the differentiation between products will increasingly be due to the use of high-performance computing techniques: in-memory techniques, vector processing, massive parallelism, and use of HPC instruction sets.

This may help you to get ready…

The Fog is Getting Thicker…

San Francisco in fog (Photo credit: Wikipedia)

I renamed this so that Teradata folks would not get here so often… its not really about Intelligent Memory… just prompted by it. The post on Intelligent Memory is here. – Rob

Two quick comments on Teradata’s recent announcement of Intelligent Memory.

First… very very cool. More on this to come.

Next… life is going to become very hard for my readers and for bloggers in this space. The notion of an in-memory database is becoming rightfully blurred… as is the notion of column store.

Oracle blurs the concepts with words like “database in-memory” and “hybrid column compression” which is neither an in-memory database or a column store.

Teradata blurs the concept with a strong offering that uses DRAM as a block-IO device (like the old RAM-disks we used to configure on our PCs).

Teradata and Greenplum blur the idea of a column store by adding columnar tables over their row store database engines.

I’m not a fan of the double-speak… but the ability of companies to apply the 80/20 rule to stretch their architectures and glue on new advanced technologies is a good thing for consumers.

But it becomes very hard to distinguish the products now.

In future blogs I’ll try to point out differences… but we’ll have to go a little deeper into the Database Fog.

How Good Is Teradata’s Intelligent Memory?

A 30 feet chunk of the cliff below the apartment building fell to Pacific Ocean. (Photo credit: Wikipedia)

Jason asked a great question in the comment section here… he asked… does Teradata’s Intelligent Memory erode HANA’s value proposition?  Let me answer here in a more general way that is applicable to the general database space…

Every time a vendor puts more silicon between the CPU and the disk they will improve their performance (and increase their price). Does this erode HANA’s value proposition? Sure. Every advance by any vendor erodes every other vendor’s position.

To win business a new database product has to be faster than the competition. In my experience you have to be at least 30% faster to unseat the incumbent. If you are 50% faster you will win a lot of business. If you are 2x, 100%, faster you win nearly every time.

Therefore the questions are:

  • Did the Teradata announcement eliminate a set of competitors from reaching these thresholds when Teradata is the incumbent? Yup. It is very smart.
  • Does Intelligent Memory allow Teradata to reach these thresholds when they compete against another incumbent. Yup.
  • Did it eliminate HANA from reaching these thresholds when competing with Teradata? I do not think so… in fact I’m pretty sure it is not the case… HANA should still be way over the 2x threshold… but the reasons why will require a deeper dive… stay tuned.

In the picture attached a 30 foot chunk eroded… but Exadata still stands. Will it be condemned?

Note: Here is a commercial post on the SAP HANA blog site that describes at a high level why I think HANA retains a distinct architectural advantage.

Memory Trends and HANA

If the Gartner estimates here are correct… then DRAM prices will fall 50% per year per year over the next several years… and then in 2015 non-volatile RAM (see the related articles below) will become generally available.

It has been suggested that memory prices will fall slower than data warehouses will grow (see here). That does not seem to be the case… and the combination of cheaper memory and then non-volatile memory will make in-memory databases like SAP HANA ever more compelling. In fact, as I predicted… and to their credit, Teradata is adding more memory (see here).

Related articles

Exit mobile version
%%footer%%