A Story of Hadoop Disillusionment…

Hype - for a future blog post
(Photo credit: kerryj.com)

Here is a true story… fuzzed just a little to disguise the real-life characters…

Three years ago… a friend calls to say: “Our new CxO just informed us that we needed to install a 1000-node Hadoop cluster in the next two months. I said… cool, what is the use case? He says… don’t argue with me… just get 1000 nodes up and running in the next 60 days. I say: there is no floorspace or power for that large a system. He says: do it in the next 60 days!”

My friend then decommissioned several systems that were doing productive, but expendable work, and installed 1000 nodes of Hadoop. And it sat there with no business problem to solve.

Today there is a little work running on the cluster… adding far less value than the expendable work that was decommissioned. The CxO is gone… with a glowing resume that says that he deployed one of the World’s largest Hadoop clusters.

When the hype over a technology gets so amplified that the hypers start hyping about the level of the hype… Hype-squared… you know that disillusionment cannot be far behind.  Gartner is pretty spot on with their Hype Cycle (see here)… but Hadoop may survive, methinks.

Readers… any other good Hadoop hype stories to share?

Getting started with Hadoop… Enhance Your Data Warehouse Eco-system

Gartner thinks that the Big Data hype is going to die down a little for the lack of progress… (see here) Companies without web-scale, big, data are finding it hard to do anything commercially interesting… still CIO’s sense that Hadoop is going to become important. This post provides a suggestion that might help you to get started.

Hadoop goes here

In most data warehouse eco-systems there is an area, a staging place, where data lands after it is extracted from the source and before it is transformed. Sometimes the staging area and the ETL process are continuous and data flows through the ETL hardware system without seeming to land… but it usually is written somewhere.

The fact is that often enterprises only move data to their data warehouse that will be consumed by a user query. Often users want to see only lightly aggregated data in which case aggregation is part of the ETL process… the raw detail is lost. A great example of this comes from the telecommunications space. Call details may be aggregated into a call record… and often call records are sufficient to support a telco’s business processes.

But sometimes the detail is important. In this case the staging area needs to become a raw data warehouse… a place where piles of data may be stored inexpensively for a time… possibly for a long time.

This is where Hadoop comes in. Hadoop uses inexpensive hardware and very inexpensive software. It can become your staging area and your raw data warehouse with little effort. In subsequent phases, you can build up a library of the jobs that need to look at raw data. You might even start to build up a series of transformations and aggregations that might eventually replace your ETL system.

This is what Sears Holdings is up to (see here).

As I suggested in an earlier post, the economics of Hadoop make it the likely repository for big data. Using Hadoop as the staging area for your data warehouse data might provide a low risk way to get started with Hadoop… with an ROI… preparing your staff for other Hadoop things to come…

 

A Look Back at 2012

There seems to be a sort of odd tradition for bloggers to look back at the past year as the New Year starts to unfold. Here is my review of my posts and some presents

New Years Eve at Borovets, outside hotel "...
(Photo credit: Wikipedia)

Top Post

Far and away the most viewed post was Exalytics vs. HANA What are they thinking? This simply notes that these two products are not really comparable sharing only the descriptor “in-memory”.

My Favorite Post

I liked this the best… ’nuff said: What is Big Data?

OK, here is my 2nd favorite: A Quick Five Minute Rule Update for In-memory Databases, but you probably need to read the prequel first: The Five Minute Rule and In-memory Databases

These papers and the underlying thinking by smarter folks than I will inform you about the definition of Hot Data from the point of pure IT economics.

The Most Under-rated Post

This is the post I thought was the most important… as it might strongly influence data warehouse platform buying decisions over the next few years… And it might even influence the stocks you pick: The Future of Hadoop and Big Data DBMSs

Some Other Posts to Read

Here are two posts that informed me:

The Five Minute Rule… This will point you to a Wikipedia article that will point you to the whole series of papers.

What Every Programmer Should Know About Memory… This paper goes into gory detail about how memory works inside a processor. It is hardware-centric for you software folks… but provides the basis for understanding why in-memory DBMSs are fast and why Exadata is not an in-memory DBMS.

And some other Good Stuff

Kevin Closson on Exadata

Google Research

Thank you for your attention last year. I hope that each of you has a safe, prosperous, and happy new year…

– Rob

Microsoft SQL Server Announcements – November 2012

Here is one I composed for SAP on the HANA blog about the recent Microsoft SQL Server announcements that is not too obnoxiously pro-HANA. It is more about the data architecture required to handle a world where the client is a mobile device and every query must complete sub-second. This, I believe is where we are headed… taking those BI queries that run in an hour on weak warehouses and improving the response to 10 seconds won’t cut it if your user is on a mobile device… and if the query is customer-facing you will be out of business…

The only way to solve for this is to get lots of silicon between you and your data… and hope that no queries miss the cache… or put it all in-memory.

———

I might have added to the work post that anytime a database vendor pre-announces a product that is due out in 1-2 years,  “2014-2015” in this case, it is marketing not architecture… meant to freeze SQL Server customers in place while Microsoft tries to catch up.

Make sure to have a look at the comments… there is a great link to a Microsoft mouthpiece who suggests that I must have no technical background and that I am a liar. Nice.

“Big Data is Essentially All Data” – B. Devlin

I would like to recommend you to Barry Devlin’s post here titled “Big Data is Dead… Long Live All Data”. The post ends with the paragraph:

“All this says to me that big data as a technological category is becoming an increasingly meaningless name.  Big data is essentially all data.  Is there any chance that the marketing folks can hear me?”

I could not agree more. If “big data” is meaningful then, as I have argued, it must be a new thing associated with several newish sources of data that come in large volumes like social media data or sensor data or log data. But the term is so abused that I no longer believe that it is salvageable. Big Data is all data… it is any data…

So of course it is true that business must prepare for it (here), that cloud computing must support it (here), that it is more than just a technology issue (here), that organizations need to be aligned (here)… and so on (note that these are just the four most recent tweets on my feed… I could go on and on). How can this be the driver of new IT spending? How can it be the driver of anything?

The point is that everything that has ever been said about data and data warehousing is being restated as new thinking related to big data. If we measured the information entropy we would find no new information is present.

Big Data is Big Hype… Fuel for Bloggers and Pundits…

 

Teradata, HANA and NUMA

Teradata is circulating a document to customers that claims that the numbers SAP has published in its 100TB PoC white paper (here) demonstrates that HANA suffers from scaling issues associated with the NUMA-effect. The document is so annoyingly inaccurate that I have to respond.

NUMA stands for non-uniform-memory-access. This describes an architecture whereby each core in a multi-core system has some very fast local memory accessed directly through a memory bus… but has access to every other core’s local memory through a “remote” access hop over another fast bus. In the case of Intel Xeon servers the other fast bus is know as the QPI bus. “Non-uniform” means that all memory access are not equal… a remote access over the QPI bus is slower than access over the memory bus.

The first mistake in the Teradata document is where they refer to the problem as the “SMP Knee Curve”. SMP stands for symmetric multi-processing… an architecture where multiple cores share the same memory bus. The SMP Knee Curve describes the problem when too many cores are contending for the same bus. HANA is not certified to run on an SMP system. The 100TB PoC described above is not run on an SMP system. When describing issues you might expect Teradata to at least associate the issue with the correct hardware architecture.

The NUMA-effect describes problems scaling processors within a single NUMA node. Those issues can impact the ability to continuously add cores as memory locking issues across the QPI bus slow the system. There are ways to mitigate this problem, though (see here for some examples of how to code around the problem).

Of course HANA, which built an in-memory system with NUMA as a target from the start… has built in these NUMA mitigations. In fact, HANA is designed deeper still using special techniques to keep the processor caches filled and to invoke special-purpose SIMD instructions. HANA is built so close to the hardware that processor cycles that are unused due to cache misses but show up as processor busy are avoided (in other words, HANA will get more work done on a 100% CPU busy system than other software that will show 100% CPU busy). But Teradata chose to ignore this deep integration… or they were unaware of these techniques.

Worse still, the problem Teradata calls out… shouts out… is about scaling over 100 nodes in a shared-nothing configuration. The NUMA-effect has nothing at all to do with scale out across nodes. It is an issue within a single node. For Teradata to claim this is silliness at best. It is especially silly since the shared-nothing architecture upon which HANA is built is the same architecture Teradata uses.

The twists Teradata applies to the numbers are equally absurd… but I’ll stop here and hope that the lack of understanding they exhibit in throwing around terms like “SMP Knee Curve” and “NUMA-effect” will cast enough doubt that the rest of their marketing FUD will be suspect. Their document is surely not about architecture… it is weak marketing… you can see more here

Netezza Workload Management

@henryccook made an interesting point regarding Netezza workload management this morning… He suggested that once a SPU is engaged by a snippet the work must be completed before another snippet can start. To say this another way…  a SPU has no OS and cannot save context for a snippet and start another… then return.

If this is true it means that if a long-running snippet starts… a full file scan of a fact table with no use of the zone map… then that snippet will lock out others queries until it completes.

This is not a very fine-grained approach to workload management and we would expect it to cause difficulties.

Can anyone confirm that this is true? It feels right from an architectural perspective…

 

Data Science or Data Alchemy

English: Karl Popper in the 1980's.

 

I love the tweetness of the @howarddresner posts restated here regarding data science… and the dialog it has started… and I would like to add a twist to the conversation.

 

First a story…

 

I once consulted for a company who was building a marketing service for their clients. They targeted customers with products provided by those clients using data they had in-house. The company had a team of data scientists who built targeting models. The same team built models/reports that evaluated the effectiveness of the targeting. Somehow their evaluations demonstrated that they were brilliant… producing results that were unprecedented and completely justifying the service.

 

But a close look at the targeting and the evaluation showed that the targeting was weak and that the results were grossly inflated.

 

Karl Popper tells us that science must be falsifiable. But science requires enough rigor that somebody must attempt to falsify any claims.

 

Data Science” is not science in this respect. It is alchemy… and the shortage of data scientists is twice as bad as you think… because there must be two data scientists for every claim of data discovery… one to discover it and one to test the discovery.

 

My Father used to say” “Figures never lie… but liars figure”. I am not accusing all data scientists of being as unethical as those in my story… but I worry that under the algorithmic mumbo-jumbo there emerges new versions of the truth that are utterly untested and will often prove inaccurate.

 

 

More on The Future of Hadoop and of Big Data DBMSs

 

First, you should look at Google’s Spanner paper here… this is the next-gen from Google and once it is embraced by the open source community it will put even more pressure on the big data DBMSs. Also have a look at YARN the next Map/Reduce… more pressure still…

Next… you can imagine that the conventional database folks will quibble a little with my analysis. Lets try to anticipate the push-back:

  • Hadoop will never be as fast as a commercial DBMS

Maybe not… but if it is close then a little more hardware will make up the difference… and “free” is hard to beat in price/performance.

  • SSD devices will make a conventional DBMS as fast as in-memory

I do not think so… disk controllers, the overhead of non-memory I/O, and an inability to fully optimize processing for in-memory will make a big difference. I said 50X to be conservative… but it could be 200X… and a 200X performance improvement reduces the memory required to process a query by 200X… so it adds up.

  • The Price of IMDB will always be prohibitive

Nope. The same memory that is in SSD’s will become available as primary memory soon and the price points for SSD-based and IMDB will converge.

  • IMDB won’t scale to 100TB

HANA is already there… others will follow.

  • Commercial customers will never give up their databases for open source

Economics means that you pay me now or you pay me later… companies will do what makes economic sense.

The original post on this is here

Big Data is Important… the Phrase “Big Data” Has Become Meaningless

This week an “industry leader” stood in front of a large IT conference and stated that “big data” is any data volume or data complexity that puts you out of your comfort zone. This is not helpful. It makes the definition of big data subjective and psychological. I can see the cartoon now:

Dilbert: I just loaded some new data…

Freud: How does that make you feel, Dilbert?

Industry leaders are trying to get companies to come to grips with the software, hardware, and staffing/expertise issues related to a new opportunity. The operative word is “new”.

Here is the Google Trend for the term “Big Data“:

Big Data is new… it is NOT any data that makes you feel queasy. People have been uncomfortable with data since computing began.

Big data is about the collection, storage, and analysis of the detailed data that new technology is generating.

The problem is that everyone wants to use the phrase to expound whatever thing they have to sell: every product by every vendor supports big data… and every “industry leader” with every talk needs to include the phrase in the title of their talk and repeat it as many times as possible. So every data warehouse pitch is rehashed as a big data pitch, every data governance, master data management, OLAP, data mining, everything is now big data.

Let’s stop. I Big Data for one, Big Data refuse to Big Data pander to the Big Data boost one gets from Big Data using the phrase to get Big Data attention.

I almost forgot… here is my best previous post on the topic…