The past decade has seen data storage costs decrease dramatically, this allowed the collection of larger volumes of data, flooding businesses with ever more raw data. For many organizations, this abundance has not translated into better analysis; many are in a state of being data rich but insight poor. Exploiting large volumes of raw data is still burdensome, complex, and time-consuming due to costs in scaling both analytical technology and technical skills.
From a business perspective, this makes answering new business questions in a usable timeframe a huge challenge. Experimentation is a key ingredient for effective model development, which is severely impacted by these time and cost constraints and ultimately hinders the ability to explore this big data that is being accumulated. These factors all prevent many businesses from getting value out of their Big Data sets, trapped in the constant tradeoff between processing latency, human resources, technological costs, and the requisite accuracy of results.
Tackling the Big Data Puzzle
A constantly-evolving attempt to reduce the cost of analysis is called “data cognition.” Basically, this is a set of techniques that mimic the human mind’s ability to process and understand data. It applies various machine learning techniques like neural networks to interpret large volumes of data.
Sisense has been working in stealth mode to develop a cutting-edge data cognition engine we’re calling Sisense Hunch™. Sisense Hunch approximates the results of SQL queries on massive data sets, creating deep neural nets (DNN) that capture the insights from terabytes of data in models that are only a few megabytes in size. The most exciting part about Sisense Hunch is that query response times stay the same even as data volumes increase, enabling quick querying with high levels of accuracy on even the largest datasets. This eliminates the need for enormous investments in infrastructure and specialized skills when analyzing Big Data.
What Sets Sisense Hunch Apart
To see why Sisense Hunch is such a radical development compared to other query approximation solutions, it is important to understand the limitations of existing approximation techniques. Specifically, these solutions do not scale for large data volumes, only support a few types of analysis, and require constant access to the entire data set.
The first approach to approximation has been to take multiple data samples to map the characteristics of the overall dataset. However, sampling is a slow process and of questionable use as datasets grow extremely large. For example, sampling 1% of billion rows would require 10 million representative samples, each a few million rows in size.
Another approach has been to preprocess vast volumes of queries, but this limits the flexibility to perform new and varied types of analysis. Lastly, there are tools that apply statistical modeling to the data, but these require constant access to the data set to process any changes.
Sisense Hunch overcomes all three barriers to provide an approximation engine that is scalable, supports varied analysis, and can work decoupled from the data set. In fact, the Sisense Hunch model is so lean, it can be installed on any device: a mobile phone, IoT sensor, or even appliances like air conditioners.
The Architecture of a Hunch
Sisense Hunch uses recurrent neural networks to train, test and deploy. Recurrent neural networks support loops allowing information to persist and be “memorized” over long sequences of data. This is particularly useful to interpret and train SQL queries, which can potentially be long and complex.
To start, Sisense Hunch scans an entire dataset and establishes a baseline understanding of the structure of the data and its attributes distribution. Based on this initial map, a training dataset is created by generating many aggregated SQL queries that are executed against the data to retrieve their actual result. These SQL queries are encoded to numeric matrices and fed into the recurrent neural network. In this process, the Sisense Hunch model learns the relations between the structure of these varied queries (what’s in the where clause, the select clause, etc.) and their results. This creates a degree of query approximation for new queries that achieves a minimum of 85% of predictions that are 5% or less distance from the real result.
Once the queries are defined, a Sisense Hunch model can typically be built and validated within 24 hours. Sisense Hunch models are deployed and made accessible using an API. Deployment is supported either in a cloud infrastructure that can supply the required query processing load and results in throughput. Or they can be installed on-premise, utilizing servers with GPUs. New samples of data can be passed via the API to analyze as if part of the entire dataset. New data can also be fed into Sisense Hunch to continuously build a more accurate model, executed as a batch process.
Sisense Hunch Advantage and Applications
Sisense Hunch offers distinct benefits. First, Sisense Hunch models can be completely decoupled from the original dataset. This means a model can be supported offline or behind a firewall offering support for any type of security infrastructure.
The Sisense Hunch model is an order of magnitude smaller than the actual datasets, compressing terabytes of data models of a few megabytes. This makes Sisense Hunch particularly mobile and deployable to almost any device or sensor.
The fact the Sisense Hunch model is decoupled and compressed means a Sisense Hunch model can be quickly accessed to query the existing dataset or compare the accuracy of new values. This is supported even for unique SQL queries and previously untrained analysis without significant loss in accuracy. Sisense Hunch offers broad support for SQL including aggregations, where, and group-by clauses, and is expanding to support more complex SQL queries.
These factors make Sisense Hunch scalable even as data grows, portable (since it is decoupled and compressed), and agile, as it supports changing analytical queries on the fly. This combination of capabilities is suitable for many use cases such as manufacturing, security, finance, healthcare, and broad model development.
One use case, that we call “real-time edge analytics,” is providing real-time analytics to manufacturing factories. For example, production line machinery could use a Sisense Hunch model to make their own decisions, in real-time, on whether to continue a manufacturing process, stop it, or involve a human expert. This level of autonomy can drastically reduce defects and increase yield.
As another example, any data discovery and exploration process can be done with Sisense Hunch to speed up the process of feature selection and model building (usually by 4 or 5 orders of magnitude). For instance, if a data analyst has a hypothesis regarding churn he can use Sisense Hunch to run thousands of queries on different segments and use the results to validate the hypothesis.