Real-time Analytics and BI: Part 1 – Singing for my Dinner

Several months ago I was invited to a dinner attached to a data science summit… with the price being that I had to deliver a 5 minute talk… I had to sing for my dinner. The result was this thinking on real-time analytics and the Toyota Prius.

Real-time analytics implies two things:
 
  1. Changes in the data are evaluated continuously; and
  2. The results of the analysis are used or displayed continuously.
In a Toyota Prius we can see two examples of real-time analytics.
 
The first is in the anti-lock braking system. There data reflecting the pressure on the brake pedal and on rotation of each wheel is sent to a computer that analyzes the results and adjusts the brake pressure on each wheel so that all four wheels turn at the same rate and the car stops in a straight line.
 
Note that the analytics are real-time and the results are used immediately without human intervention. This is important. It makes little sense to spend the money to capture and analyze data in real-time if the results are not actionable in near-real-time.
 
Think for a moment about the BI systems built over the last 20 years. First we captured and analyzed monthly data… and acted on that data within a 30-day window. Then we increased the granularity of the data to weekly and slightly adjusted the reports to reflect the finer granularity… and acted on the data within 7 days. Then we adjusted the data to daily and acted on the results each day. Then we adjusted the data to hourly and reacted even more quickly. These changes often did not fundamentally change the business processes driven by the data… they just made the processes more sensitive to the fine-grained information.
 
But if the data-driven business process takes ten minutes to complete… for example it takes ten minutes for staff to pick inventory, package the results, and load a delivery truck; could there be a return on the investment expense of developing a continuous, real-time analytic? I think not. There may, however, be ROI associated with a new robotic pick, package, and load process…
 
There is another possibility… If sometimes the pick, package, and load takes ten minutes and sometimes it takes fifteen minutes then the best solution is to perform the analytics on the current state on-demand… when there are resources to support the process. This maximizes the use of the resources without changing the business process.
 
The point here is that real-time requires a re-think… or at least a deep-think. The business process may have to change significantly to support real-time analytics.
 
The second real-time system in the Prius illustrates the problem. On the dashboard the Prius displays, in real-time, the state of the hybrid gas-electric system. It shows whether the battery is charging or discharging… it shows whether the car is being driven using the electric or the internal-combustion engine. It is one of the most beautiful dashboard displays you have ever seen… and executives everywhere must look at it and wonder why they cannot get such a beautiful display of the state of their business… after-all…  BI dashboards are “the thing”.
 
But the Prius display is useless. There is no action you would take while driving based on this real-time display.From a decision-making view it represents useless and expensive flash (that helps to sell the Prius…).
 
So… approach real-time analytics with a deep-think. Look for opportunities like the anti-lock braking system where real-time analytics can be embedded into automatic business processes. Avoid flashy dashboards that do not present actionable data.
 
In-memory databases (IMDB) such as SAP HANA, Oracle TimesTen, and VMWare SQLFire promise to enable real-time analytics… and this promise is real… the opportunities can and will revolutionize the enterprise over time…  but a revolution is not the same old BI at a finer granularity… it is much more significant than that. Heads will roll.

Cloud Computing and Data Warehousing: Part 1 – The Architectural Issues

My apologies… I was playing with the iPad version of WordPress and accidentally published a very rough outline/first draft of this post. I immediately un-published it… but not before subscribers were notified that there was a new post.

I wonder about the idea that data warehousing is suited to operate in the cloud? This was prompted by Paraccel‘s venture to deploy on the Amazon EC2 cloud infrastructure. Lets work through the architectural implications…

Here are the assumptions I’ll take into this exploration:

  1. A shared-nothing architecture is required to scale.
  2. Cloud infrastructure is cost-effective when the infrastructure is under-utilized and workloads can be consolidated to achieve full utilization… and not so cost-effective when the infrastructure is highly utilized. This is because applications can easily share underutilized resources in the Cloud.
  3. Cloud infrastructure is justified when the workload is inconsistent and either CPU or storage requirements fluctuate widely over the business cycle. This is because a Cloud is elastic and can easily flex as the requirements fluctuate. Cloud computing may not be well suited to static workload requirements.

You can probably see where I’m going with this from the assumptions.

In the end I’ll suggest that there is a database architecture that is suited to warehousing and cloud computing… but let me build to that.

Before I start let me also be clear that I am talking about the database infrastructure… not the application/BI infrastructure required for data warehousing. The BI and ETL components are perfectly suited to cloud computing… they reflect a workload that, in general, runs on under-utilized hardware with BI running during the day and ETL running at night. I have suggested this to my current employer… but alas, I am neither King nor a member of Court.

So in Part 1 let me discuss my first two assumptions and the implications… In Part 2 I’ll discuss data warehousing and elasticity… In Part 3 I’ll consider the Paraccel/Amazon collaboration and in Part 4 I’ll wrap up and consider several new things coming that may change the equations.
—————-
I’ll not work too hard to justify my first assumption… I think that it is well-understood that a shared-nothing architecture provides the best possible approach to scale out. Google and others use this approach to scale to hundreds of petabytes of data and Teradata, Greenplum, Netezza, Paraccel, SAP HANA, and others use it in the data warehouse space. Exadata uses a hybrid approach that scales I/O in a shared-nothing-like storage subsystem… but fails to scale as it passes data to the RAC layer (see Kevin Closson here on the subject).

But the implications are significant for our cloud discussion. First, cloud infrastructure is designed to support general client-server or web-server based commercial computing requirements. A shared-nothing database cluster is a specialized infrastructure optimized for database processing. Implementing the specialized problem on the generalized infrastructure is possible, but sub-optimal. Next, cloud computing requires, more or less, a shared storage subsystem. A shared-nothing architecture shares nothing. Implementing a shared-nothing database on a shared storage subsystem is possible, but sub-optimal.

I believe that the second assumption is also pretty straightforward. The primary rationale for cloud computing comes from the recognition that many data centers deployed applications on servers that were not fully utilized. By virtualizing the hardware on a cloud platform the data center could better service the applications with fewer hardware resources and therefore less cost.

So… in order for cloud computing to be a perfect fit we need to observe a data warehouse database workload with underutilized hardware infrastructure… You might ask yourself… are there underutilized hardware resources upon which my EDW is built? In most cases I believe that the answer to this question will be “no”. Almost every EDW I’ve seen is over-burdened… stretched… with users demanding more and more resource… more data, more users, more queries, deeper queries drive the resource requirements up exponentially. The database is swamped all day with queries and swamped all night by ETL and reporting tasks.

So let’s end this blog concluding that there is a problematic architectural mismatch between a shared cloud and a shared-nothing implementation… and that if your warehouse database platform is highly utilized then there may be little benefit from implementing a warehouse in the cloud.

See Part 2 here

Exalytics vs. HANA: What are they thinking?

I’ve been trying to sort through the noise around Exalytics and see if there are any conclusions to be drawn from the architecture. But this post is more about the noise. The vast majority of the articles I’ve read posted by industry analysts suggest that Exalytics is Oracle‘s answer to SAP‘s HANA. See:

But I do not see it?

Exalytics is a smart cache that holds a redundant copy of aggregated data in memory to offload aggregate queries from your data warehouse or mart. The system is a shared-memory implementation that does not scale out as the size of the aggregates increase. It does scale up by daisy-chaining Exalytics boxes to store more aggregates. It is a read-only system that requires another DBMS as the source of the aggregated data. Exalytics provides a performance boost for Oracle including for Exadata (remember, Exadata performs aggregation in the RAC layer… when RAC is swamped Exalytics can offload some processing).

HANA is a fully functional in-memory shared-nothing columnar DBMS. It does not store a copy of the data.. it stores the data. It can be updated. HANA replaces Oracle… it does not speed it up.

I’ll post more on Exalytics… and on HANA… but there is no Exalytics vs. HANA competition ahead. There will be no Exalytics vs. HANA POCs. They are completely different technologies solving different problems with the only similarity being that they both leverage the decreasing costs of RAM to eliminate the expense of I/O to disk or SSD devices. Don’t let the common phrase “in-memory” confuse you.

Predictive Analytics, Event Processing, and Rules… Oh, and in-memory databases…

There is a lot of talk these days about predictive analytics, big data, real-time analytics, dashboards, and active data warehousing. These topics are related in a fairly straightforward way. Further, there are new claims about in-memory database processing that blends these issues into a promise of real-time predictive analytics. Lets tease the topic apart…

Predictive analytics is really composed of two parts: modeling and scoring.

Modeling requires big data to discover which information from the enterprise predicts some interesting event. Big data is both broad and deep: broad because no one knows which data elements will be predictive… “which elements” has to be discovered in the modeling exercise… and deep is required because it takes history to detect a trend.

The model that results represents a rule: if the rule’s conditions are true then we can predict the outcome within some statistical boundaries. For example: “if a payment has not been received within 120 days and the customer’s account balance has dropped over 40% in the last year then there is a 83.469271346% chance that the customer will default on the payment. Note that you can create a rule without predictive modeling and without a statistical boundary… for example: “if payment has not been received within 120 days then the customer will likely default”. We have been creating these heuristic rules since time began… and the point of predictive analytics is to discover rules more accurately than by heuristics. You might say then that a model defines, or predicts with some certainty, an event of interest. The definition may be described as a set of rules.

Scoring requires only the elements discovered by the modeling exercise… and may or may not require big deep data. Big data is required if any of the discovered elements represents a trend. For example, if to predict a stock price there are elements that represent the average price over the last 30 days, 90 days, 180 days, etc.; and that calculate the difference between the 90 day number and the 30 day number to show the trend; then either the data has to be aggregated on-the-fly from the detail… or it must be pre-aggregated. This distinction is important… the result of a modeling exercise may require the creation of some new aggregated data. Note that we are suggesting that a score depicts an interesting event.

Real-time analytics, or more fairly, near-real-time analytics, requires these rules to be checked ASAP after new information is available.

A dashboard can provide one of two features: either the dashboard applies the rules and presents alerts when some rule triggers… or the dashboard may present raw data for evaluation by a human. For example, the speedometer on your car dashboard presents raw data and it is up to you to apply rules based on the input. Note that the speedometer on your car provides a real-time display. Sometimes BI dashboards use real-time displays like a speedometer to display static data. For example, I have seen daily metrics displayed using a speedometer widget… but since the speedometer updates just once a day this is clearly metaphorical.

Active data warehousing implies some sort of rule-based activity. The activity may be triggered in near-real-time or as a batch process.

But in any consideration of real-time processing there is an issue if the rules cross data input boundaries. By this I mean… it is simpler to build a speedometer that reads from one input, the rotation of the axle, than to build a meter that incorporates multiple inputs… for example the meter that displays how many miles/meters you have left before you run out of petrol. But to provide this in real-time your car has real-time access to two inputs… and an embedded processor is required to integrate the data and derive the data display. Near-real-time displays of data warehouse data have the same constraints… if the display requires data from more than one source that data has to be acquired and integrated and calculated/scored in near-real-time. This is a daunting problem if the sources cross application system boundaries.

In the book “In-memory Data Management” Plattner and Zeier promise near-real-time analytics from an in-memory DBMS, HANA. But there is no discussion of how this really works for a data warehouse across source systems based on integrated data. Near-real-time predictive modeling requires broad data that will cross these systems boundaries. It may be possible to develop a system with near-real-time data acquisition and integration can occur… and rules may be applied immediately to identify interesting events. But this sort of data acquisition is very advanced… and an in-memory database does not inherently solve the problem… every application in the enterprise will not be in the same memory space.

I do not see it. SAP may live in a single memory space… and maybe every possible application of that data can live there as well. But as long as there is relevant data outside of the space data integration is required and the argument for real-time weakens.

Exit mobile version
%%footer%%