The 6 Crucial Steps of Preparing Data for Analysis
If you want to get high quality, actionable insights out of your data, you need to nail every
step in the process – all the way back to the source.
Mistakes in the preparation stage will cost you dearly, making it impossible to extract
genuinely useful, accurate analysis later on.
In this whitepaper, we’ll take you through each stage in detail, showing you how to define,
identify, connect, wrangle, load and verify data drawn from a myriad of sources, ironing
out any issues along the way.
Introduction: What could possibly go wrong?
If you’re in any doubt over the impact that data prep has on your company’s ability to function and profit, just picture London at rush hour.
Until about a decade ago, getting to an unfamiliar place in the city was hell. London has 270 Underground stations arranged across 11 lines, plus several overground train operators, 700 bus routes, and even an on-demand bicycle scheme.
Without an effective, reliable way to clean, aggregate and harmonize data, you’re like a lost tourist with a big pile of obsolete maps and timetables.
Figuring out how to get from place to place meant staring at a tube map to mentally calculate which combination of lines would be fastest, consulting individual bus and train timetables… and a whole lot of guesswork. Not only was it woefully inefficient, but if any of your information was incomplete or out of date, all your careful planning would be useless.
Then, along came Google Maps.
Suddenly, you could bring together real-time data from all sources of transport information, automatically cross-referenced with location data that factored in delays and cancellations, and calculations of how long it takes to walk the parts of your journey that have to be on foot. For the first time, you had an entirely accurate picture of your ideal route – and how long it would take.
Today, organizations the world over are dealing with a migraine-inducing trilogy of more data, more users, and less time.
Whereas data was once collected for the main benefit of regulators, users from all across the organization, including departments like HR and customer support as well as finance and sales, now need it to help them navigate to their most pressing destinations.
A new study from Harvard Business Review Analytic Services reveals a dramatic gap between business leaders’ expectations and realities of using data to achieve their business outcomes. Less than 24% of study participants believe their organizations use data effectively. Yet 94% say it’s essential to their ongoing innovation strategy.
What’s more, the largest companies now rely on an intimidating 36 different data sources on average, much of which is fragmented, siloed, or inconsistent. Unlocking value from this mass of data relies on rock-solid techniques for preparing this for deep-level analysis.
Let’s look at the 6 steps to perfect data prep.
Step 1: Define
Doug Henschen, Principal Analyst at Constellation Research, discusses in his new report that dashboards are the default source for daily, inter-day and near real-time insight, yet the requirement to switch between dashboards and the transactional and productivity apps where people do their day-to-day work can impede productivity.
In other words, your data projects are driven by specific business goals, and defining those requirements at the outset will help you ensure that the right people have access to the right analytics, in the right format, later on.
You’ll need to consider:
- The main business questions you want to answer
- The KPIs you’ll need to measure
- Who needs access, their technical level, and how they’ll use the data (for example, do they need to apply rules or build their own dashboards?)
- The analytical applications you’ll need to bring the data into later
You will also need to start thinking about how you’ll answer these questions… and specifically, what kind of data you’ll need to work with to do that.
For example, let’s say that you’re trying to figure out what customers really think of a product line, and what is most likely to get them to recommend it to someone else.
The conventional way to approach this is to examine transactional data – what’s being sold, what’s being returned, the number of canceled subscriptions, trends in referral numbers, etc.
This picture is incomplete, though. For a far better idea of what customers are thinking, you would look at what they actually say – whether that’s on social media, blogs, comment sections, etc.
Working with unstructured data or conducting things like semantic and location analysis does, of course, make things a lot more complicated, as will be clear in the coming steps.
However, including this gives you a far more accurate picture of your data, which in turn informs better decision-making and helps you build successful campaigns that are personalized and relevant. Being prepared for the challenge will save you a lot of time, resources and stress.
Step 2: Identify
You’ll also need to consider the size of each dataset, whether in terms of rows or megabytes, and whether you need to analyze the whole thing or just a subset.
Now that you have a broader idea of what you’re up against, you can dive deeper into the data to figure out exactly where what you need is currently stored.
For example, is the data stored in data warehouses or data lakes? Excel? Operational databases like CRMs that store CSV files? Is it coming from external partners, like Facebook Ads? Will you need to use tools like Hadoop to handle these sources in their native formats?
As we touched on in the first step, this goes well beyond transactional data or even external structured data from government or third-party sources. It’s likely that you’ll also have to deal with unstructured, text-based data (documents, social media, customer sentiment, online reviews, etc.), sensor and machine-to-machine data from IoT devices, and even location/geospatial data.
Just as importantly, what breadth of data types are you looking at?
Generally speaking, choosing to bring in all your data retains granularity but impacts performance – although Sisense gets around this by favoring in-chip over in-memory tools that limits the strain on your hardware.
Once you have your data sources, you will need to consider how they fit together – or what elements of one table are related to another table.
Get your head around the underlying schema, check you have the credentials to access what you need, and take a good, hard look at this data to figure out what’s suitable for your purposes. Start with some sort of chart or list listing all the sources – and what state it’s in. As we’ll see later, you might have a sizeable cleanup operation on your hands before you can extract any value.
Note that relationships between data sets can have widespread implications for the setup you use. That applies whether you’re working with multiple data tables from the same database, or from different ones. In either case, you will need to consider carefully the relationship between the tables you want to work with – and how this sits with your ultimate analysis purposes.
Consider issues like the relative size and detail of each table. Is one a lot bigger than the other? Does one contain a lot more granular detail?
While as a general rule it’s best to ask these questions in advance, before you start setting up your data source, bear in mind that some problems won’t emerge until you begin the process.
There are a few types of relationship scenarios you can work with. You can see here the Star Schema which is used when you need to enrich an existing table with more information. For example, you have multiple tables with information in each that will add another level of insights into a table.
The main table (Sales Fact Table) contains relevant data from the other tables, but none of the tables are related to each other.
By identifying these types of relationships early on in this stage, you will be able to create dashboards with deeper insights into your business functions, giving you a much richer base from which to analyze and make decisions.
Step 3: Connect
Now it’s time to start mapping out how all that data is going to fit together, bringing in only what you need.
If you’re working with many different data sources and tables, you’ll need to model the data in a way that enables users to run ad-hoc queries, and build and manipulate dashboards and
reports, quickly and easily.
This includes finding common fields you can use to merge various datasets together, and ways to combine tables / create a summary and key tables. Pay close attention as you map out the relationships between the data, too – including one-to-one, one-to-many, and many-to-many relationships.
Relationship types include:
- One-to-One relationship – In this scenario, both sides of the relationship have unique values for every row.
- One-to-Many relationship – In this scenario, one side of the relationship will contain unique values for every row, but the other side of the relationship will contain duplicate values for any or all of the corresponding values in the first table.
- Many-to-Many relationship – In this scenario, both sides of the relationship will hold duplicate values, causing excessive calculations for every query run against it.
There are several methods to resolve and bypass a many-to-many relationship that we will cover in the next section.
Note that manually connecting data spread across many different platforms, sources, and tables, means navigating all these platforms’ own, internal ways to connect. This means you may need to find several different APIs to help you broach these gaps. That said, if you’re using ElastiCubes with built-in Sisense connectors, the process is standardized and handled from one place, removing much of the complexity for you.
Step 4: Wrangle
This stage is about ironing out the problems that become apparent when you start to connect disparate data.
Here, you’ll need to ensure that everything is in the right structure and consolidated so that you can treat it all as a single version of truth. This includes ironing out inconsistencies and errors.
For example, you might need to combine fields like “firstname” and “lastname” into a single field, or standardize conflicting formats (e.g., one table could use the category “USA” and another “United States”).
This means checking each source to see if it’s complete, accurate, and up-to-date – which can be very time-intensive, especially if data must be manually transformed or manipulated, or if you want to group the data in new ways. You want to avoid Many-to-Many relationships in particular.
In databases, a Many-to-Many relationship exists when the value in each field used to create a relationship between tables is included multiple times in each table.
The problem with Many-to-Many relationships is that it can cause duplications in the returned datasets, which can result in incorrect results and might consume excessive computing resources.
For example, a hotel may have a table with reservation data and a table with payment data. Both tables include the name of the guest. A guest can have multiple reservations under their name as well as multiple payments recorded in their name. If a relationship exists between the reservation and payment tables based on the guest’s name, a Many-to-Many relationship is created, as the guest’s name appears multiple times in each table.
Almost every person that has dealt with a bit more advanced data modeling has run into data\performance issue at some point, and after a long troubleshooting process has found the root cause to come from a Many-to-Many relationship.
Make sure you have a way to test for potential Many-to-Many relationships. Sometimes, running an SQL statement can help you find the different types of relationships so that you can plan some workarounds and solutions before you start building your dashboards. And in some cases, you can define alerts to let you know when you have a Many-to-Many relationship. Having proactive alerts can prevent poor dashboard performances, wrong result sets, and possibly save many hours of troubleshooting.
Step 5: Load
This is the point where you move all your cleaned-up, connected data into one, centralized location and apply the overarching logic to it there. This secondary environment might be a data mart or an analytical database – it depends on your setup.
When choosing where to load your data, consider how up-to-date you need your data to be at all times (in other words, how frequently you will query it), and where data will update from.
For example, if you need to run reports monthly, daily or hourly, you will need the database to sync regularly enough to facilitate this.
One of the most important things here is whether you have either powerful enough hardware / an effective Cloud-hosted solution and/or a Single Stack BI platform that lets you dip into just the data you need (as Sisense’s ElastiCubes do). This will affect your ability to crunch large datasets or run regular/ad-hoc reports.
You should think whether to import and replace data for the entire table or bring in only accumulated data.
Use these general definitions to define the best scenarios for your data:
- Replace Data: Replaces all data at the time of the build. This is recommended for dimension tables, for example: store attributes, or dimensions, which describe the objects in a fact table.
- Accumulate Data: Adds additional data to existing table data, without making comparisons and without omitting data. This is recommended for detailed fact tables, for example: store quantitative information for analysis.
- Ignore When Unchanged: Does not import any data unless changes have occurred in the table schema. This is recommended with summary/snapshot fact tables and with data marts (smaller subsets of data, tailored for specific needs).
A good example of this can be taken from what happens in an ElastiCube. The following table describes how the data will be imported using different build options and table behavior combinations. This can give you an idea on how to structure your data loads.
Step 6: Verify
Despite the jitters that robots are about to steal all our jobs, this step is proof that there are some things only a human can do (at least, for now)!
Here is where you double check that you haven’t made any mistakes along the way – that something hasn’t gone awry in the data as it’s been connected, transformed and loaded into your analytical system. And for that, you’ll need common sense.
Take something that you know (roughly) the answer to and make sure that the system throws out the right answer when you query the data.
You don’t have to know what the exact answer should be, just that whatever result you’re getting makes logical sense, based on what you already know about the organization. You
may want to create a simple pivot table like the one below with all the dimensions and fields that you intend to bring in and make sure it aligns with your expectations.
For example, if you run a report or create a dashboard that suggests top performers on your sales team have no closed deals this quarter, or that profits have gone up when you know they’ve gone down, something has messed up somewhere in the data or the schema – and you’ll have to drill down further to figure out what it was.
Proper, thorough data preparation right from the start leads to faster, more efficient insights down the line. The steps we’ve outlined here apply to whatever technical setup you are using – and they will get you better results.
That said, you can also streamline the entire process by looking out for a single stack solution that brings as much of this as possible into a single, holistic value chain while reducing TCO at the same time.
If you do opt for this over expensive legacy systems, just make sure you avoid in-memory systems that will either slow you down or sacrifice granularity.
Not only does a well-chosen single stack make it easier to spot issues and manage all your actions from a single platform, it also demands far more accountability from your vendor; they can’t keep passing the buck farther along the stack if something goes wrong!
Want to see how we do it at Sisense?Start free trial Check out our demo here