Deep Data Exploration
Advanced Analytics and Insights Using Python and R
Modern data teams are laser-focused on maximizing the effectiveness of data analysis and the value of the insights that they uncover. The challenge of achieving these goals is only intensifying when faced with the escalating volume of data and the increasing variety of its sources. Success involves data teams exploring data as deeply as possible, extracting and analyzing the best quality data, and employing technologies and techniques that enable them to take the fullest advantage of the data they collect. The key to this success is choosing a platform that supports the programming languages that can evolve and enable the team to prepare data and create value with new types of analysis.
Platforms that let analysts utilize advanced programming languages, such as Python and R, can do much more than counterparts that rely on SQL alone. They enable users to benefit from new, more sophisticated developments such as machine learning, natural language processing, and the latest advances in data cleaning. Most significantly, they allow users to analyze data more deeply than ever, identify answers to questions that would have been impossible to ask using just SQL, and find insights that weren’t previously achievable.
That’s why Sisense for Cloud Data Teams includes support for Python and R. Integrating these two languages into our product gives companies a much more efficient way to manage their current data processes and brand new abilities to analyze data at brand new depths. In this paper, we’ll discuss some of these new abilities to help your team see the value of incorporating new languages into your data workflow.
The Rise of Machine Learning and Predictive Analytics
Predictive analytics and the importance of clean data
Traditionally, you would use your BI and analytics platform to analyze existing trends, identify backward-looking behaviors and respond to them. But you don’t have to stop there. Machine learning enables you to identify and predict outcomes that can influence your organization’s strategy. This ability is extremely helpful, but it depends heavily on having complete, reliable, “clean” data. Advanced programming languages help you clean your data far better than outdated, static languages, so you can be more confident that you have a strong foundation for making forward-thinking decisions.
Machine learning in Sisense for Cloud Data Teams with Python and R
In order to identify complex patterns in current datasets, you need sophisticated tools, and that’s what Python and R offer. Machine learning really comes into its own when these languages are used to apply algorithms to huge volumes of existing or past data and then extrapolate those patterns onto new inputs to predict potential outcomes: in other words, find trends in existing data and anticipate how they will impact new data.
Integrating Python and R into the Sisense for Cloud Data Teams platform gives you expanded data cleaning and analysis abilities that will streamline the creation of new machine learning models. If you can rely on having large, clean datasets, you can be confident about the means with which you can predict new trends with that data. The result is added value in the form of more accurate machine learning models. Since the models train on existing data, the benefits of clean data also compound into smarter, faster predictions in the future.
Advanced Statistical Analysis
Organizations are becoming increasingly data-driven to stay ahead of the competition, and they can now do it with the help of predictive analytics, aided by advanced programming languages.
It’s a virtuous circle. Organizations need to go beyond making decisions based exclusively on previous results. They want to confidently predict outcomes and make decisions based on those predictions. SQL alone can’t do this it’s more of a descriptive language that’s only good at explaining what’s happening. R and Python allow teams to answer questions about why something is happening and what is going to happen next. Those languages can be used to identify patterns that correlate with potential outcomes far more efficiently, often in just a single line. The example below shows the difference between the way SQL determines a correlation between just two variables, and the same type of function written in R, which can analyze many relationships at once in a large matrix of data.
It’s immediately clear why more organizations are adopting R and Python into their data analysis. As they do, advanced analytics is becoming a more prominent part of the decision-making process with the C-Suite, because they are uncovering previously unidentifiable patterns and visualizing them more effectively. Plus, Python and R are dynamic languages that can be refined as more experts use them, with features and capabilities added often, guaranteeing access to the best and most up-to-date tools.
Imagine that your team was tasked with finding ways to minimize customer churn. Using Sisense for Cloud Data Teams’s advanced language support, you could pass a table of data from SQL into R and run correlations across as many variables as you can imagine. Any factors that show a strong enough correlation would be a great starting point for identifying causal relationships with churn.
To establish causal relations, your team could make educated recommendations about operational adjustments based on those variables. Those hypotheses could be tested for a period and anything that shows a statistically significant improvement can be formally adopted into your company’s process. It’s an easy way to cut through the noise and pinpoint exactly what is driving your business. These new languages give data teams the tools to look for connections that they think are relevant while still allowing data to be the ultimate decision-maker.
Advanced statistical analysis in Sisense for Cloud Data Teams
In the Sisense for Cloud Data Teams platform, it’s easy to shift data from SQL into R or Python, where you can analyze and visualize that information before putting it into a dashboard. Those dashboards get refreshed instantly and automatically with the most up-to-date information, so once you’ve built a visualization once, it’s always current. This enables everybody within your organization to see the freshest data without needing data scientists to download it or re-run reports. Without needing too repeat routine work, your data professionals are free to move on to new projects and more complex analyses.
Go Beyond Quantitative Data with NLP
Data analysis has traditionally focused on numbers: quantitative data. But there’s a world of other information and insights out there than can be gleaned from other sources — text, in particular. It has always been a challenge to capture, organize, analyze, and visualize textual data and without the right technology, organizations are missing out on the benefits they could create from data of this kind. The wonderful thing about languages like R and Python is that they can overcome this challenge.
The key to unlocking value from text-based data is Natural Language Processing (NLP). It can organize and analyze textual and verbal communication, which is far less structured and sequential than numbers. Python has libraries such as its Natural Language Toolkit (NLTK) that can process speech and writing patterns and has the ability to include context into the analysis of these patterns, resulting in powerful analysis. It far exceeds what SQL can do with NLP. The possibilities for NLP-based insights are limitless and tools like NLTK will only continue to improve.
Using Python to gain access to insights that would otherwise be unavailable or would involve more cumbersome and time-consuming techniques gives you a huge competitive advantage. A good example is analyzing sentiment, where data teams can identify what is being said about a product or a brand and analyze those statements for meaning. Python can do this quickly and efficiently, obviating the need for a method like focus groups. It also provides more value, because you can analyze an enormous amount of copy and scale analysis wider with little effort.
Natural Language Processing in Sisense for Cloud Data Teams
One Sisense for Cloud Data Teams customer that is making the most of NLP is Crisis Text Line, a free, anonymous 24/7 text-based crisis intervention system that aims to mitigate crises by connecting people to counselors who are trained to cool down hot moments. They use natural language processing and machine learning to pull insights from their rich data set and identify keywords in texts to help steer a counselor toward a safe resolution. Later, a second phase of this process utilizes a large community of professional counselors to analyze conversations based on common keywords and tags to help assess trends and train counselors to have high-quality conversations with texters.
This innovative approach to predictive modeling allows Crisis Text Line to detect keywords that identify and predict trends in real time. The Crisis Text Line data team uses Sisense for Cloud Data Teams to conduct this complex analysis and quickly visualize the results. In the near future, the team plans to set up a self-service data environment that will empower counselors to access information without help from the data team. This setup would give counselors quicker access to data and ultimately lead to better-informed conversations with texters. Often, the end users have difficulty predicting the needs of texters ahead of time, so a data tool that relies on upfront modeling is ineffective. An agile data environment like Sisense for Cloud Data Teams allows the team of counselors to find answers on their own.
Creating Complex Visuals with Python and R
With all this advanced data comes the challenge of presenting it clearly. You can only realize the potential of deep data analysis if the results are easily understandable by as many people as possible within your organization, particularly the decision-makers who may not have technical know-how. It stands to reason that complex data requires tools that can handle visualizing complex results and concepts.
Traditionally, using simple visuals, analysts have developed charts showing multiple results by creating each layer and then stacking them together into a single visual. It’s a slow and limited process that’s extremely hard to scale efficiently. Python and R expedite this process because they have comprehensive charting libraries that enable you to visualize many results at once and show the interrelations between them. This makes it far easier to do deep analysis and meets a practical customer need.
Some visuals are designed to tell multiple stories, especially the more complex ones. Consider the chart below, which displays the mileage performance of vehicles with different engine types. A first look would illustrate that vehicles with fewer cylinders in the engine would appear to get better overall mileage while driving in the city. But there’s more to this chart: the 8-cylinder engine has a unique shape that needs explanation, the 4-cylinder engine has a long tail while the other two have definite limits and there’s a peculiar bimodal distribution in all three.
This chart can be examined for more findings, but it’s clear that the complete story this data is telling goes deeper than anything that could be derived from simple tables or bar charts.
As growing volumes and types of data have become more rapidly available, customers increasingly want more options to build new charts. With Python and R, you get the capability to develop your own visualizations and dashboards, and customize them however you want, in ways that aren’t limited by a fixed set of options or the need for technical experts to build one-off charts. The Sisense for Cloud Data Teams platform includes over 25 charting libraries to help you customize your visualizations, and the number of options will grow as both languages develop further. This enhances your ability to demonstrate results creatively, clearly, and comprehensibly. Consequently, it makes the results of the data more accessible, so understanding these results is simpler, and the path to achieving more insights gets smoother. Quite literally, we’re giving power to you, the builders. Furthermore, the ability you gain with Python and R to visualize complex data means you now have the capability to tell richer stories than you can with standard, basic data.
Creating more complex visuals in Sisense for Cloud Data Teams
Complex visuals are made easy in R and Python. For example, in R, the ggplot2 package allows very detailed control over the aesthetics of a chart. If you want to map variable_a to the transparency, then alpha=variable_a. If you want to map variable_b to the color, then color=variable_b.
Python also has a lot of customization with visuals through matplotlib. Data teams may prefer to use either language, depending on their background, to visualize data in any way that helps them communicate with their stakeholders. Once the visuals have been created in R or Python, they can be saved directly onto a Sisense for Cloud Data Teams dashboard and will appear beside all the other charts.
As the saying goes, the proof of the pudding is in the eating. Since integrating Python and R, we’ve already seen customers enthusiastically use them to take their data charts to a new level, extending far beyond typical line, bar, pie charts, and the like, to a whole new variety of visualizations such as heat maps, box and whisker charts, log scales, and much more.
In order to get great insights from your data, you need to be sure that what you’re analyzing is accurate and relevant: in short, “clean”. There’s no point wasting precious time, effort, and resources on “dirty” data such as unnecessary duplication, inaccuracies, or out-of-date information. It’ll only hinder your business and it’s expensive. It’s estimated that this can cost companies as much as 12% of overall revenue, amounting to $3.1 trillion wasted each year in the US alone.
To compound the problem, it has been acknowledged that data scientists spend around 80% of their time preparing and managing data for analysis. Most of this time involves cleaning and organizing data, leaving just a small proportion for analysis and adding value. Besides being arduous and wasteful, it makes data teams disgruntled because 76% of them consider data preparation to be the least enjoyable part of their work.
It’s hugely welcome that advanced coding languages like Python and R can help expedite the data cleaning process since they include packages that enable data teams to perform bulk cleanup. With Sisense for Cloud Data Teams, the data is taken from SQL and passed into one of these languages for editing and bulk cleanup in just a fraction of the time it would normally take.
For example, Python’s re library makes string operations much faster and simpler than using SQL for the same action, dramatically reducing the amount of time and effort that goes into cleaning. Consider a dataset with a lot of missing data. Built-in Pandas functions such as fillna and dropna allow data scientists to treat all empty cells in a range the same way. Those cells can be filled with the mean, median, or specific values (fillna) or removed entirely (dropna). Other large-scale cleanup activities like removing duplicates can also be handled with individual lines of code rather than the time-intensive processes that must be used to complete the same task in SQL.
As a result, the data cleaning process becomes much more efficient, so analysts can spend less time and fewer resources on it, and more time doing what they’re good at: namely research and analysis that result in strong insights that add real value to your organization.
Powerful Data Transformations
Once data has been cleaned and analyzed, you can analyze and generate insights. In order to maximize the impact of these insights, it’s important that they’re understood by as many people as possible within your organization. This involves the critical process of turning data tables into visualizations, and once again, using Python and R makes the process easier, faster, and more effective than just using SQL.
How Python and R make this easier
In SQL, queries can be run to produce tables, but if you want to create charts, you’ll need to pass these queries into a BI platform like Sisense for Cloud Data Teams. This means data is being prepared in one environment and then visualized in another. Moving the data in this way, you may risk losing some of the formatting in translation. On the other hand, Python and R are set up for data visualization. For instance, Python enables you to pivot (spread) or melt (gather) data, and R has spread and gather functions in a library like Tidyr that makes it easy to map data, manipulate and restructure data tables within a single environment. In the two images below, pivoting changes the data from the image on the left to look like the image on the right. Melting does the opposite.
Also, Python speeds up the analysis and plotting of data by converting object datatypes into category types that don’t need anywhere near as much memory to manipulate the data and visualize it.
Restructuring Data in Sisense for Cloud Data Teams
Sisense for Cloud Data Teams makes it easy to reformat data tables in advanced languages. Just run a SQL query to process the dataset, then pass the table into R or Python and use the reshape2/ tidyR or pandas libraries to execute the transformation with a simple command. Using just SQL, manipulations like this would take multiple lines of complex transposing code. In some cases, queries that would take 50-100 lines to perform and run in SQL can be managed in R or Python with just a single line.
Once the data has been transformed, R and Python offer more advanced charting libraries that can create complex, customized visuals for data teams. Visuals that have been created to fit the specifications and preferences of key stakeholders; data teams can pass them directly into Sisense for Cloud Data Teams to be included in shared dashboards.
Deeper Analysis and Enhanced Insights with Python and R in Sisense for Cloud Data Teams
We have seen how integrating Python and R into your BI and analytics platform takes your ability to analyze and visualize data to the next level.
The Sisense for Cloud Data Teams platform’s support for these advanced coding languages opens the door to far more comprehensive and impactful insights and enables you to answer previously unanswerable questions. Even better, it can all be done simply and speedily. It’s a true revolution in analytics.
If you want to see how you can benefit from Sisense supporting Python and R, set up a free trial with us here. And if you have any questions, contact us and one of our experts will reach out to you soon.