We live in a world of data: there’s more of it than ever before, in a ceaselessly expanding array of forms and locations. Dealing with Data is your window into the ways Data Teams are tackling the challenges of this new world to help their companies and their customers thrive.
In recent years we’ve seen data become vastly more available to businesses. This is mostly due to the rise of data warehouses, decrease in cost for data storage, and modern ETL tools that make storing and accessing data more approachable than ever before. This has allowed companies to become more and more data-driven in all areas of their business. In fact, being data-driven has become ubiquitous and imperative to survival in today’s climate. This article will discuss at a high level how modern businesses are leveraging new technology to ingest a wider variety of data sources.
As of 2019, according to Blissfully’s 2020 SaaS trends report, smaller companies (0-50 employees) use an average of 102 SaaS applications, while mid-sized ones (101-250 employees) use about 137 SaaS applications. Enterprises, meanwhile, can have as many as 288 SaaS applications in use across the entire organization!
These SaaS products range from CRMs to marketing automation apps to billing software to help desk tools, etc. These applications can log thousands to millions of rows of data made available via API, much of which can hold valuable insight for businesses. Add to this the company’s own internally-gathered product data and it’s clear that even a small or medium-sized business can do a lot with the data at its disposal. In this article, we’ll cover what the ETL process is, why it’s important to powering data-driven businesses, and how the rise of modern ETL tools has changed the ETL process forever.
The ETL process: Foundation of data-driven businesses
Every company wants every team within their business to make smarter, data-driven decisions. Customer support teams look at trends in support tickets or do text analysis on conversations to understand where they can provide better onboarding and documentation. Marketing teams want better visibility into their ad performance across different platforms and the ROI on their spend. Product and engineering teams dig into productivity metrics or bug reports to help them better prioritize their resources.
The ETL process empowers all these different teams to get the information they need to understand and perform their jobs better. Businesses ingest data from a wide array of sources through the ETL process, which stands for Extract, Transform, Load. The prepared data is then available for analysis and use by the various teams who need it, as well as for advanced analysis, embedding into applications, and use for other data monetization efforts. Whatever you want to do with data, you have to ETL it first.
Explaining the traditional ETL process
Traditionally, the extraction step of the process involves batch-pulling the data from the source via files, application database, or an API. In the transformation step, the data is cleaned and modeled so that it’s fit for analysis and ingestible by the data warehouse. Finally, the cleaned tables are loaded into the data warehouse for analysis.
This traditional ETL process is very difficult to scale. It often requires full-time data engineers to develop and maintain the scripts that keep the data flowing. This is because the data providers often make changes to their schemas or APIs, which then break the scripts that power the ETL process. Every time there’s a change, the data engineers scramble to update their scripts to accommodate them, resulting in downtime. With businesses now needing to ingest data from so many disparate (often fast-moving) data sources, maintaining ETL scripts for each one is not scalable.
Better living through ETL tools: The modern ETL process
The modern ETL process follows a slightly different order of operations, dubbed ELT. This new process arose as a result of the introduction of tools to update the ETL process, as well as the rise of modern data warehouses with relatively low storage costs.
Today, ETL tools do the heavy lifting for you. They have integrations for many of the major SaaS applications, and have teams of engineers who maintain those integrations, taking the pressure off of your in-house data team. These ETL tools are built to connect to most major data warehouses, allowing businesses to plug in their apps on one end and their warehouse on the other, while the ETL tools do the rest.
Users can usually control orchestration via a simple drop-down selections within the apps, alleviating the need to stand up your own servers or EC2 box or building DAGs to run on platforms like Airflow. ETL tools can also typically offer more robust options for appending new data incrementally, or only updating new and modified rows, which can allow for more frequent loads, and closer to real-time data for the business. With this simplified process for making data available for analysis, data teams can focus on finding new applications for data to generate value for the business.
The ETL process and data warehouses
Data warehouses are the present and future of data and analytics. Storage costs on data warehouses have drastically reduced in recent years, which allows businesses to load as many raw data sources as possible without the same concerns they might have had before.
Today, data teams can ingest raw data before transforming it, allowing them to do the transformations in the warehouse instead of a separate staging area. With the increased availability of data and a common language to access that data, SQL, it allows the business more flexibility in leveraging their data to make the right decisions
The modern ETL process: Delivering better results, faster
Under the traditional ETL process, as data and processing requirements grew, the chance that on-premise data warehouses would fail grew as well. When this happened, IT would have to swoop in to fix the issue, which usually meant adding more hardware.
The modern ETL process in today’s data warehouses sidesteps this issue by offloading the compute resource management to the cloud data warehouse. Many cloud data warehouses offer compute scaling that allows for dynamic scaling when needs spike. This allows data teams to still see scalable performance while holding increased numbers of computationally expensive data models and ingesting more large data sources. The decreased cost in compute power along with compute scaling in cloud data warehouses allows data teams to efficiently scale resources up or down to suit their needs and better ensure no downtime. The bottom line is that, instead of having your in-house data and/or IT team fretting over your data storage and computing issues, you can offload that pretty much completely to the data warehouse provider.
Data teams can then build tests on top of their cloud data warehouse to monitor their data sources for quality, freshness, etc. giving them quicker, more proactive visibility into any problems with their data pipelines.
From ETL to ELT and beyond
Data, analytics, and BI have radically evolved since their inception. We are leaps and bounds beyond Excel tables and on-prem-centric data sources. Cloud-native data warehouses, cloud-native architecture in analytics and BI platforms, and embedded analytics powered by these systems have redefined what it means to be truly data-driven in our modern age.
The ETL process has been updated and can now deliver insights from a wide array of datasets, which helps companies and teams of all kinds make smarter decisions, faster. It also opens the doors for advanced analytics, next-level data monetization, and much more. Whatever you’re building with your data, a modern ELT setup will help you get more from your data, easier.
While we have seen a large migration to data warehouses, looking to the future, we might see even another evolution that involves data lakes being more widely used as the initial destination for all raw data sources, offering even further cost benefits. Additionally, new tools are starting to take form that allow for transformation within these data lakes, thus continuing the evolution of tools and processes within the data pipeline following the path of ETL to ELT.
Adam Luba is an Analytics Engineer at Sisense who boasts almost five years in the data and analytics space. He’s passionate about empowering data-driven business decisions and loves working with data across its full lifecycle.