Data Analysis: Art or Science?We talk a lot about the science side of data analysis and BI: the calculations and algorithms needed to perform complex queries. Sure, a big part of BI is math, but making sense of data – planning how to structure your analysis at one end, and interpreting the results at the other – is very much an art form, too.
What Is Exploratory Data Analysis?Exploratory Data Analysis (EDA) is the first step in your data analysis process. Here, you make sense of the data you have and then figure out what questions you want to ask and how to frame them, as well as how best to manipulate your available data sources to get the answers you need. You do this by taking a broad look at patterns, trends, outliers, unexpected results and so on in your existing data, using visual and quantitative methods to get a sense of the story this tells. You’re looking for clues that suggest your logical next steps, questions or areas of research. Want to incorporate R for deeper statistical learning? Watch our on-demand webinar to learn how to use a growing library of R functions for deeper predictive analysis. Developed by John Tukey in the 1970s, exploratory analysis is often described as a philosophy, and there are no hard-and-fast rules for how you approach it. That said, it also gave rise to a whole family of statistical-computing environments both used to help define, “What is EDA?” and to tackle specific tasks such as:
- Spotting mistakes and missing data;
- Mapping out the underlying structure of the data;
- Identifying the most important variables;
- Listing anomalies and outliers;
- Testing a hypotheses / checking assumptions related to a specific model;
- Establishing a parsimonious model (one that can be used to explain the data with minimal predictor variables);
- Estimating parameters and figuring out the associated confidence intervals or margins of error.
Tools and TechniquesAmong the most important statistical programming packages used to conduct exploratory data analysis are S-Plus and R. The latter is a powerful, versatile, open-source programming language that can be integrated with many BI platforms… but more on that in a moment. Specific statistical functions and techniques you can perform with these tools include:
- Clustering and dimension reduction techniques, which help you to create graphical displays of high-dimensional data containing many variables;
- Univariate visualization of each field in the raw dataset, with summary statistics;
- Bivariate visualizations and summary statistics that allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at;
- Multivariate visualizations, for mapping and understanding interactions between different fields in the data;
- K-Means Clustering (creating “centres” for each cluster, based on the nearest mean);
- Predictive models, e.g. linear regression.
Sisense integrates with many data sources – See it in action: