Big data analytics is like gold mining, you have to go through tons and tons of dirt to get to a few specks of the good stuff. It’s much the same with data when you take the conventional mining approach using data warehousing, and it can be very costly. It has to be approached as an exercise in progressive analytics - done it in stages. Firstly, you need to target the data actually needed, a bit like using a sonar to locate blocks in a partitioned manner in the database, or you can look for patterns of usage.
There are two or three options to do this. You can collect all the events, or you can filter and cleanse them down the ‘specks of gold.’ If you look at a graph showing high computation versus data volume it is apparent you can’t do very high computation effectively in the initial states so you ‘cleanse’ the data and go straight to the target for very detailed analysis. Therefore, more cleansing gives you a higher level of analytical computation capability.
This is typically what is done in mining minerals and it is easy to see how the term ‘data mining’ came about. Mechanisms have to be put in place to manage the whole distillation process, regardless of the size of business. This involves creating special ‘appliances’ for the task, e.g. what Oracle is doing with Exadata storage servers, by taking Sun boxes and loading them with Oracle databases, all tuned for purpose. Of course, that’s not new. IBM did the same thing years before with the AS400 platform. Both are basic data file systems.
The alternative is to adopt a clustered processing model, possibly using the cloud, so as the data comes in it is processed in real-time or online (near real-time) mode and non-time critical data can be split out to be processed in batch mode. The decisions on how specific data is to be processed is based on the business needs with specific business rules set accordingly. For example, if charging for a service is dependent on average usage for the current period, or changes according to tier levels being reached, then real-time analysis is required. However, if pricing is based on the previous month’s average then batch mode would suffice.
Of course, each business decision of this type has a cost associated with it. The more real-time processing required, the more memory, processing power and bandwidth is required. Careful determination of the business needs, and subsequent business rule management, can have substantial cost implications. The best way to determine which data needs to be treated in real-time or otherwise involves the classification of data as logical data units.
In ‘real-time’ mode the analysis of data takes place during the transaction execution, e.g. handling prepaid calls. If doing the analysis after the transaction is completed but used immediately after, it is defined as ‘online.’ Anything processed outside these two stages becomes ‘batch.’
Examples of real-time analysis would be when somebody is trying to come onsite to preform a transaction and you want to analyse their behaviour to determine whether to enable the transaction and whether to modify the offer to the client during the process. If the analysis is done immediately after the transaction is completed, and is used to influence the customer’s next transaction, then it would be ‘online.’ If the analysis is used to determine what plan to propose next time the customer is spoken to, then batch is used. The ability to do these things effectively and cost efficiently is becoming critical.
So how best to decide what data coming in is to be processed? In this model, data is classified depending on the channels it comes in on, usually set by the service delivery platform (SDP), or what attributes it contains. If all data coming in via a specific channel is consistent then it can be processed the same way, as in the case of prepaid calls that need to be processed within 15 milliseconds. Where data has to be analysed for specific attributes, the processing will take longer, but those attributes will help determine how to route the data dynamically and how it will subsequently be processed.
Real-time analytics with a predictive model is most useful in enablement of online interactions and is a goal that most CSPs are trying to achieve, but it should not be entered into without a clear strategy and understanding of the cost benefits.