The Hype of Big Data

Hype Cycle for Emerging Technologies 2010
Hype Cycle for Emerging Technologies 2010 (Photo credit: marcoderksen)

As preface to this you might check out the definition I suggested for Big Data last week here… – Rob

I left Greenplum in large part because they made their mark in… and then abandoned… the  data warehouse market for a series of big hype plays: first analytics and data science; then analytics, data science, and Hadoop; then they went “all-in”, their words, on Big Data and Hadoop… and now they are part of Pivotal and in a place that no-one can clearly define… sort of PaaS where Greenplum on HDFS is a platform.

It is not that I am a Luddite… I pretend each time I write this blog that I am in tune with the current and future state of the database markets… that I look ahead now and then. I just thought that it was unlikely for Greenplum to be profitable by abandoning the market that made them. At the time I suggested to them an approach that was founded in data warehousing but would let them lead in the hyped plays… and be there in front when, and if, those markets matured.

Now, if we were to define markets in an unambiguous manner:

  • a data warehouse database is primarily accessed through one or more BI tools;
  • an analytic database is primarily accessed through a statistical tool; and
  • Hadoop requires Hadoop;

then I suspect that the vast majority of Greenplum revenues still,  3-4 years after the move away from data warehousing, come from the DW market. It is truly a shame that this is not the focus of their engineering team and their marketeers.

Gartner has called it pretty accurately in their 2013 Hype Cycle for Emerging Technologies here. Check out where Big Data is on the curve and how long until it reaches the mainstream. Worse, here is a drill-down showing the cycle for just Big Data. Look at where Data Science sits and when they expect it to plateau. Look at where SQL for Hadoop is in the cycle.

Big Data is real and upcoming… but there is no concise definition of Big Data… no definition that does not overlap technologies that have been around since before the use of the term. There is no definition that describes a technology that the Fortune 1000 will take mainstream in the next 2-3 years. Further, as I have suggested here and here, open source products like Hadoop will annihilate the commercial market for big analytic databases and squeeze hard the big EDW DBMS players. It is just not a commercially interesting space… and it may not become commercially interesting if open source dominates (unless you are a services company).

Vendors need to be looking hard at Big Data now if they want to play in 2-3 years. They need to be building Big Data integration into their products and they need to be building Big Data apps that take the value straight into the business.

Users need to be looking carefully for opportunities to use Hadoop to reduce costs… and, in highly competitive markets which naturally generate lots of machine-to-machine data, they need to look for opportunities to get ahead of the competition.

But both groups need to understand that they are on the wrong side of the chasm (see here for reference to Crossing the Chasm)… they have to be Early Adopters with a culture that supports an early adopter business model.

We all need to avoid the mistake described in the introduction. We need to find commercially viable spots in an emerging technology play where we can deliver profits and ROI to our organizations. It is not that hard really to see hype coming if you are paying attention… not that hard to be a minor visionary. It is a lot harder to turn hype into profits…

The Big Data Devil

Devil
Devil (Photo credit: Wikipedia)

I just finished a draft for next week on Big Data and thought that with this note I might form a preface…

First… Big Data is about, well…, Big Data. When Gartner devised the three V’s I suspect that they were trying to frame the new stuff that was emerging… not establish a concise definition. So let me be very clear about what I think that Big Data is and is not.

Big Data is about volume, not velocity, not variety. That is what the words “big” and “data” conjoined must mean. Velocity + Volume is Big Data. Variety + Volume is Big Data. By themselves Velocity and Variety are new, important, separate, technological trends.

Next, Big Data is a new thing. It is not a technology that was around in a meaningful way 5+ years ago. It was emerging just then so we should see evidence in the advances offered by the Web Scale companies like Google, Yahoo, and Netflix. It is not any data that was conventionally created, captured, or used before 2010 or so.

So what is new, big, and was emerging in recent history? It is the creation, capture, and use of machine-generated data: click-stream data, system log data, and sensor data. Big Data technology has to do with the creation, capture, and utilization of large volumes of machine-generated data… nothing more or less.

Rob: Big Data legitimately includes Social Data as well as Vitaliy rightfully commented… I’ll post on this soon…

Machines generate data at a very low-level of detail. It is said that the devil is in the detail… and the subject of the next post deals with the notion that in order to make our companies more profitable we must all chase this damnable devil.

PS

I wonder if damnable devil is redundant? Probably, yes.

2nd PS (sort of like 2nd breakfast)

Big Data is not about any and every new technology introduced in the last five years…

Hadoop and the EDW

Squeeze If You Feel Pain
Squeeze If You Feel Pain (Photo credit: Artotem)

Cloudera and Teradata have jointly published a nice paper here that presents an interesting perspective of how Hadoop and an EDW play together. Simply put, Hadoop becomes the staging area for “raw data streams” while the EDW stores data from “operational systems”. Hadoop then analyzes the raw data and shares the results with the EDW. Two early examples provided suggest:

  • Click stream data is analyzed to identify customer preferences that are then shared with the EDW. Note that the amount of data sent from Hadoop to the EDW would be fairly small in this case.
  • Detailed data is stored on Hadoop to build analytic models. The models are then transferred to the EDW to score sales activity data. Note that in this scenario the scored activity detail has to live in Hadoop to perform modeling… but it is unclear why it has to live in the EDW as well. I presume that scoring takes place on the EDW instead of in Hadoop for performance reasons… but maybe the data, the modeling, and the scoring should just take place in Hadoop?

The paper then positions Hadoop as an active archive. I like this idea very much. Hadoop can store archived data that is only accessed once a month or once a quarter or less often… and that data can be processed directly by Hadoop programs or shared with the EDW data using facilities such as Teradata’s SQL-H, or Greenplum‘s External Hadoop tables (not by HAWQ, though… see here), or by other federation engines connected to HANA, SQL Server, Oracle, etc.

But think about the implications on how much data has to stay in your EDW if you archive everything older than 90, or even 180, days to Hadoop. The EDW shrinks significantly and the TCO advantage to your Enterprise will be significant. This is very cool.

There is one item in the paper I disagree with, though… and another statement that I think has a very short shelf-life.

The paper suggests that indexes, materialized views, aggregate join indexes, and other tweaks are what differentiates an EDW. I believe that reliance on these structures make for a fragile EDW where only some queries can run fast. I like Teradata better when it just robustly scans fast and none of these redundant-data tuning artifacts are required (more here and here). Teradata was the original scan-fast DBMS… it is more than capable.

The paper also suggests that an EDW maintains value by including a sophisticated cost-based optimizer that uses data demographic statistics to identify an optimal query execution plan. I agree that Hadoop lacks this now… but there are several projects like Cloudera Impala that will eliminate this gap in the near term.

I believe that if you read between the lines you will see more evidence to support my belief (here) that Hadoop will squeeze the EDW vendors hard… and that the size of a squeezed EDW will then fit in an in-memory database.

Aster Data for a price…

Sharp-shinned Hawk, Accipiter striatus, chromo...
(Photo credit: Wikipedia)

If Greenplum HAWQ does not look promising (see my previous posts on HAWQ here and here) what are the prospects for Teradata Aster Data… which aspires to both replace and/or co-exist with Hadoop for a fee? Teradata+Hadoop maybe… but Teradata+Aster+Hadoop seems like one layer too many… as does Aster+Hadoop.

(OK, I removed the bad “HAWQing” pun in the title… no complaints from readers… it just seemed unfair… – Rob)

HAWQ and Pivotal HD – Is it Hadoop?

Green Plums
Green Plums (Photo credit: camera bag)

First, from a technical standpoint I like the Greenplum-on-HDFS HAWQ offering. It looks like the GP Team replaced XFS with HDFS and added some native support for several HDFS file types. I will say more on this soon.

But I would like to weigh-in on the question raised by HortonWorks here… is HAWQ Hadoop? And I have a question towards the end…

Let me propose an analogy: Hadoop is an eco-system of open source components much like LINUX is an eco-system of open source components. If you think the analogy apt, then HAWQ on HDFS  is not Hadoop any more than Microsoft Internet Explorer on LINUX is LINUX. Hive is open source and part of the Hadoop eco-system… as is Impala. Firefox is open source and part of the LINUX eco-system. HAWQ is not Hadoop.

The HortonWorks link points out that Greenplum is not engaged in the Hadoop eco-system as a contributor. They also quote Greenplum as saying that they have 300 developers working on Hadoop. Well… if HAWQ is part of Hadoop and HAWQ is the Greenplum database on HDFS then they have 300 developers on Hadoop. But if, as I suggested, HAWQ is not Hadoop then the number of Greenplum developers on Hadoop might be less. I bumped into a long-time Greenplum employee at Strata who told me that HAWQ was a skunkworks project with 4-5 developers max. This comes from a credible source… but it is still rumor-quality… so take it with a grain of salt.

The bottom line is that Greenplum has marketed very aggressively. They fuzz the definition of Hadoop to claim their commercial database offering running on Hadoop is therefore “Hadoop”. They fuzz the definition of developers working on Hadoop based on this first fuzz.

But does it matter? Greenplum will read and process data stored in HDFS faster than any other SQL-based engine. That is worth something.

But what is it worth? I’m fairly certain that the Greenplum databases will run faster off of Hadoop on XFS than in Hadoop… maybe significantly faster. So the reason for Greenplum on HDFS is faster SQL access to data in HDFS files.

This leads me to my question. I wonder… were the performance numbers quoted, showing a significant performance advantage over both HIVE and Impala, based on queries executed against the Greenplum proprietary table formats or against the same native HDFS file types read by HIVE? If they ran against Greenplum tables then I wonder what the real apples-to-apples comparison would show? Note that I am not being cynical here… I do not know how the tests were set up… only that Greenplum was fast. But if as I said: ” the reason for Greenplum on HDFS is faster SQL access to data in HDFS files” and the data was in Greenplum file structures accessible only by Greenplum then there is little reason left.

I also wonder if it matters because HIVE and Impala will improve their performance significantly over the next 12-24 months. The sheer amount of human R&D being expended here will allow these SQL engines to catch, or nearly catch, HAWQ in performance. If there is any gap left, the price and the community of open source offerings will defeat HAWQ in the market.

As I have suggested here… there is no apparent commercial opportunity competing against Hadoop at this point. I suggested here that Hadoop would eat Greenplum if they stuck to the analytics space and offered both products… effectively competing with themselves. This new strategy is not likely to work in the medium or long run. Greenplum is, indeed, all-in on Hadoop… but without a winning hand.

March 10: See here for the answers to my questions… – Rob

March 12: See here for a rethink on this subject… – Rob

Will Hadoop Eat Greenplum and Netezza?

If I were the Register I would have titled this: Raging Stuffed Elephant To Devour Two Warehouse Vendors… I love the Register… if you do not read it have a look

This is a post is about the market implications of architecture…

Let us assume that Hadoop matures and finds a permanent place in the market. This is not certain with some folks expressing concern (here) and others boundless enthusiasm (here). So let’s assume… and consider where it might fit.

The SqueezeOne place is in the data warehouse market… This view says Hadoop replaces the DBMS for data warehouses. But the very mature BI/DW market requires a high level of operational integrity and Hadoop is not there yet… it is advancing rapidly as an enterprise platform and I believe it will get there… but it will be 3-4 years. This is the thinking I provided here that leads me to draw the picture in Figure 1.

It is not that I believe that Hadoop will consume the data warehouse market but I believe that very large EDW’s… those over 1PB… and maybe over 500TB will be compelled by the economics of “free” to move big warehouses to Hadoop. So Hadoop will likely move down into the EDW space from the top.

Another option suggests that Big Data will be a platform unto itself. In this view Hadoop will sit beside the existing BI/DW platform and feed that platform the results of queries that derive structure from unstructured data… and/or that aggregate Big Data into consumable chunks. This is where Hadoop sits today.

In data warehouse terms this positions Hadoop as a very large independent analytic data mart. Figure 2 depicts this. Note that an analytics data mart, and a Hadoop cluster, require far less in the way of operational infrastructure… they share very similar technical requirements.Hadoop Along Side

This leads me to the point of this post… if Hadoop becomes a very large analytic data mart then where will Greenplum and Netezza fit in 2-3 years? Both vendors are positioning themselves in the analytic space… Greenplum almost exclusively so. Both vendors offer integrated Hadoop products… Greenplum offers the Greenplum database and Hadoop in the same hardware cluster (see here for their latest announcement)… Netezza provides a Hadoop connector (here). But if you believe in Hadoop… as both vendors ardently do… where do their databases fit in the analytics space once Hadoop matures and fully supports SQL? In the next 3-4 years what will these RDBMSs offer in the big data analytics space that will be compelling enough to make the configuration in Figure 3 attractive?

Unified HadoopI know that today Hadoop cannot do all that either Netezza or Greenplum can do. I understand that Netezza has two positions in the market… as an analytic appliance and as a data mart appliance… so it may survive in the mart space. But the overlap of technical requirements between Hadoop and an analytic data mart… combined with the enormous human investment in Hadoop R&D, both in the core and in the eco-system… make me wonder about where “Big Data” analytic relational databases will fit?

Note that this is not a criticism of the Greenplum RDBMS. Greenplum is a very fine product, one of the best EDW platforms around. I’ll have more to say about it when I provide my 2 Cents… But if Figure 2 describes the end state for analytics in 2-3 years then where is the place for the Figure 3 architecture? If Figure 3 is the end state then I do not see where the line will be drawn between the analytic workload that requires Greenplum and that that will run on Hadoop? I barely can see it now… and I cannot see it at all in the near future.

Both EMC Greenplum and IBM seem to strongly believe in Hadoop… they must see the overlap in functionality and feel the market momentum of Hadoop. They must see, better than most, that Hadoop wins this battle.

A Look Back at 2012

There seems to be a sort of odd tradition for bloggers to look back at the past year as the New Year starts to unfold. Here is my review of my posts and some presents

New Years Eve at Borovets, outside hotel "...
(Photo credit: Wikipedia)

Top Post

Far and away the most viewed post was Exalytics vs. HANA What are they thinking? This simply notes that these two products are not really comparable sharing only the descriptor “in-memory”.

My Favorite Post

I liked this the best… ’nuff said: What is Big Data?

OK, here is my 2nd favorite: A Quick Five Minute Rule Update for In-memory Databases, but you probably need to read the prequel first: The Five Minute Rule and In-memory Databases

These papers and the underlying thinking by smarter folks than I will inform you about the definition of Hot Data from the point of pure IT economics.

The Most Under-rated Post

This is the post I thought was the most important… as it might strongly influence data warehouse platform buying decisions over the next few years… And it might even influence the stocks you pick: The Future of Hadoop and Big Data DBMSs

Some Other Posts to Read

Here are two posts that informed me:

The Five Minute Rule… This will point you to a Wikipedia article that will point you to the whole series of papers.

What Every Programmer Should Know About Memory… This paper goes into gory detail about how memory works inside a processor. It is hardware-centric for you software folks… but provides the basis for understanding why in-memory DBMSs are fast and why Exadata is not an in-memory DBMS.

And some other Good Stuff

Kevin Closson on Exadata

Google Research

Thank you for your attention last year. I hope that each of you has a safe, prosperous, and happy new year…

– Rob

“Big Data is Essentially All Data” – B. Devlin

I would like to recommend you to Barry Devlin’s post here titled “Big Data is Dead… Long Live All Data”. The post ends with the paragraph:

“All this says to me that big data as a technological category is becoming an increasingly meaningless name.  Big data is essentially all data.  Is there any chance that the marketing folks can hear me?”

I could not agree more. If “big data” is meaningful then, as I have argued, it must be a new thing associated with several newish sources of data that come in large volumes like social media data or sensor data or log data. But the term is so abused that I no longer believe that it is salvageable. Big Data is all data… it is any data…

So of course it is true that business must prepare for it (here), that cloud computing must support it (here), that it is more than just a technology issue (here), that organizations need to be aligned (here)… and so on (note that these are just the four most recent tweets on my feed… I could go on and on). How can this be the driver of new IT spending? How can it be the driver of anything?

The point is that everything that has ever been said about data and data warehousing is being restated as new thinking related to big data. If we measured the information entropy we would find no new information is present.

Big Data is Big Hype… Fuel for Bloggers and Pundits…

 

More on The Future of Hadoop and of Big Data DBMSs

 

First, you should look at Google’s Spanner paper here… this is the next-gen from Google and once it is embraced by the open source community it will put even more pressure on the big data DBMSs. Also have a look at YARN the next Map/Reduce… more pressure still…

Next… you can imagine that the conventional database folks will quibble a little with my analysis. Lets try to anticipate the push-back:

  • Hadoop will never be as fast as a commercial DBMS

Maybe not… but if it is close then a little more hardware will make up the difference… and “free” is hard to beat in price/performance.

  • SSD devices will make a conventional DBMS as fast as in-memory

I do not think so… disk controllers, the overhead of non-memory I/O, and an inability to fully optimize processing for in-memory will make a big difference. I said 50X to be conservative… but it could be 200X… and a 200X performance improvement reduces the memory required to process a query by 200X… so it adds up.

  • The Price of IMDB will always be prohibitive

Nope. The same memory that is in SSD’s will become available as primary memory soon and the price points for SSD-based and IMDB will converge.

  • IMDB won’t scale to 100TB

HANA is already there… others will follow.

  • Commercial customers will never give up their databases for open source

Economics means that you pay me now or you pay me later… companies will do what makes economic sense.

The original post on this is here

The Future of Hadoop and of Big Data DBMSs

Image representing Hadoop as depicted in Crunc...

About four years ago Michael McIntire and I were pondering the rise of Hadoop. This blog will share bits of that conversation, provide an update based on the state of Hadoop today, and suggest a future state…

Briefly… we believed that the Hadoop eco-system was building all of the piece-parts of a very large database management system. You could see the basics: a distributed file system in HDFS, a low-level query engine in Map/Reduce with an abstraction in Pig, and the beginnings of optimization, SQL, availability, backup & recovery, etc.

We wondered why this process was underway… why would enterprises go to Hadoop when there were perfectly good relational VLDBs that could solve most of the problems… and where they could not… extending a mature RDBMS would be easier than the giant start-from-scratch of Hadoop.

We saw two reasons for the Hadoop project, process, and progress:

  • Michael pointed out that the RDBMS vendors just did not understand how to price their products on “Big Data” (to be fair that term was not in use then)… if you have 7PB of data, as Michael did… then at the current $35K/TB list price the bill would be $245M. Even if you discounted to $1K/TB the tab would be $7M. The DBMS vendors were giving the big guys a financial incentive to Build instead of Buy… and so Google Built and Yahoo Built and Hadoop emerged.
  • I pointed out that the academic community would support this… the ability to write a thesis based on new work in the DBMS space was becoming harder… but it was possible to sponsor papers that applied DBMS concepts to Hadoop and keep the PhD pipeline filled.

So there was funding, research, and development.

The narrative from here on is my own… Michael is off the hook..

Today Hadoop has a first release in the public domain. Dozens of companies are working to extend the core… some as contributors, some with a commercial interest, many with both incentives. The stack is maturing… and we now easily imagine a day when Hadoop will rival Teradata, Exadata, Netezza, and Greenplum in VLDB performance… with some product maturity and a rich set of features. And if Hadoop gets close and the price is free (or nearly so…) then the price/performance of Hadoop will make it unbeatable for “Big Data”.

In fact, the trigger for writing this piece now was the news a few weeks back from one of our Hadoop partners that HIVE was POC’d against one of the databases mentioned above on a big data problem and came close. The main query ran in 35 minutes on the DBMS and in 45 minutes with HIVE. The end is in sight… and sooner than expected.

What might this mean for the future?

Imagine a market where Hadoop can solve for big data problems… let’s say problems over 500TB just to draw a line… with the same performance as the best RDBMS, in a write-once/read-many use case like a data warehouse… for free. For FREE… plus the cost of the hardware. Hadoop wins… no contest.

Let’s suggest a market from 50TB to 500TB where a conventional RDBMS can out-perform Hadoop by 2X more-or-less… but Hadoop is free… so only applications where the performance matters can pay the price premium.

And let’s suggest a high performance in-memory database (IMDB) market that beats disk-based and SSD-based RDBMS by 50X for a 50% premium (based on new technologies like phase-change memory see here…) and can beat Hadoop by 1000X but at a higher cost.

You can see the squeeze:

  • IMDB will own the high performance market… most-likely in the 100TB and under space…
  • Hadoop will own the big data 500TB+ low-cost market…
  • and the conventional DBMS vendors will fight it out for adequate-performance/medium-priced applications from 100TB to 500TB… with continued pressure from the top and the bottom.

Economics will drive this. The conventional DBMS vendors are moving to SSD’s… which increases their price in the direction of an IMDB… and increases their price/performance in the same good direction. But the same memory in SSD’s will soon be generally available as primary memory. So the IMDB prices and the conventional DBMS prices will converge… but the IMDB products will retain a 50X-100X performance advantage by managing the new memory as memory instead of as a peripheral device. Hadoop may or may not leverage SSD’s… but it will be free.

Squeezed, methinks…