The Big Data Bang

There is still an open question over whether, after the Big Bang, there is enough mass in the Universe to slow the expansion and cause the universe to contract. While the Big Data Bang continues to expand the universe of bits and bytes… I would like to ask whether some of these numbers are overstated? I know that the sum of the bits and bytes is expanding but I wonder if the universe of information is expanding as much as we claim?

Note that by “information” I mean a unique combination of bits and bytes representing some new information. In other words, if the same information is copied redundantly over and over does that count?

There is a significant growth industry in deduplication software that can backup data without copying redundant information. The savings from these products is astounding. NetApp claims 70% of the unstructured data may be redundant (see here). Data Domain says that eliminating (and compressing) redundant data reduces storage requirements by 10X-30X (see here).  What’s up with that?

In the data warehouse space it is just as bad. The same data lives in OLTP systems, ETL staging areas, Operational Data Stores, Enterprise Data Warehouses, Data Marts, and now Hadoop clusters. The same information is replicated in aggregate tables, indexes, materialized views, and cubes.  If you go into many shops you can find 50TB of EDW data exploded into 500TB of sandboxes for the data scientists to play with. Data is stored in snapshots on an hourly basis where less than 10% of the data changes from hour to hour. There is redundancy everywhere. There is redundancy everywhere. 🙂

I believe that there is a data explosion… and I believe that it is significant… but  there is also a sort of laziness about copying data.

Soon we will see in production the first systems where a single copy of OLTP and EDW and analytic data can reside in the same platform and be shared. It will be sort of shocking to see the Big Data Bang slow a little…

A Big Data Sound Bite…

Here is a sound bite on Big Data I composed for another source…

Big Data is relative. For some firms Big Data will be measured in petabytes and for other in hundreds of gigabytes. The point is that very detailed data provides the vital statistics that quantify the health of your business.

To store and access Big Data you need to build on a scalable platform that can grow. To process Big Data you need a fully scalable parallel computing environment.

With the necessary infrastructure in place the challenge becomes: how do you gauge your business and how do you change the decision-making processes to use the gauges?

Exit mobile version
%%footer%%